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ABSTRACT 


Under  an  idealized  model  of  parallel  computation, 
parallel  graph  theoretical  algorithms  are  studied  using 
adjacency  matrices  as  the  representation  of  graphs.  Two 
optimal  algorithms  for  computing  A  (n)  =  a  ( 1)  0a  (2)  0. .  .  0a  (n) 
are  presented  whose  time  bounds  are  proven  to  be  equal  to 
the  theoretical  lower  time  bounds  in  both  bounded  and 
unbounded  parallelism,  where  ®  is  binary  associative.  The 
technique  for  processor  optimization  is  demonstrated  through 
an  example  on  graph  problem:  finding  the  connected 
components  of  an  n-node  undirected  graph.  Efficient  parallel 
graph  algorithms  for  detecting  the  existence  of  negative 
cycles,  finding  the  strongly  connected  components,  verifying 
unilateral  connectivity  and  acyclicness  of  an  n-node 
directed  graph  are  developed.  Each  algorithm  runs  in 
0 (log  n(n2*81)/P)  time  with  P  <  n2*81/log  n  processors. 
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CHAPTER  1 


INTRODUCTION 

Because  of  the  generality  in  their  mathematical 
structure,  graphs  have  proven  to  be  a  useful  abstraction 
wherever  discrete  objects  and  binary  relations  are  dealt 
with.  They  have  aided  analysis  in  biology,  chemistry, 
computing  science,  electrical  and  civil  engineering, 
operations  research,  linguistics,  sociology  and  many  more. 
These  uses  of  graphs  require  many  graph  theoretical 
algorithms.  Consequently,  since  the  early  sixties,  a  great 
deal  of  effort  has  been  directed  to  producing  and  analysing 
graph  theoretical  algorithms.  However,  the  graphs  resulting 
from  many  real-life  problems  are  often  so  large  that  the 
worst  case  complexity  analysis  is  extremely  pessimistic. 
This  necessitates  the  development  of  more  efficient  and 
economical  algorithms. 

To  tackle  this  problem,  three  approaches  have  commonly 
been  used  in  the  literature.  The  first  technique  is  to 
search  for  better  algorithms  with  respect  to  time.  The 
second  method  is  to  improve  the  circuit  switching  speeds 
technologically.  The  third  way  is  via  parallel  processing. 
In  particular,  parallel  computers  which  are  capable  of 
performing  several  independent  operations  concurrently  are 
designed  which  tradeoff  the  amount  of  hardware  for 
computation  time. 
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Tremendous  progress  has  been  made  along  the  first  two 
lines  for  the  past  two  decades.  But  optimal  algorithms  can 
be  improved  no  more  in  the  first  approach,  yet  may  still  be 
unwieldly  for  large  graphs.  And  as  improvements  in  switching 
devices  and  miniaturization  rapidly  reach  their  physical 
limits,  it  is  evident  that  any  significant  improvement  in 


processing 

speed 

must 

be 

obtained  by 

the 

concurrent 

processing 

of  a 

number 

of 

operations. 

For 

the  third 

approach,  various  multiple-processor  systems  have  been 
proposed  and  constructed.  As  a  result  of  the  recent 
revolution  of  microprocessors  and  the  steadily  dropping 
hardware  prices,  large-scale  parallel  computers  with  as  many 
as  214  to  216  processors  have  become  feasible  [33,40]. 

Since  a  parallel  algorithm  may  be  obtained  by 
recognizing  the  inherent  parallelism  of  a  sequential 
algorithm,  many  people  think  that  parallel  computations  are 
mere  extensions  of  sequential  computations.  This  intuition 
is  not  always  correct.  Adopting  an  efficient  sequential 
algorithm,  such  as  depth-first  search,  on  a  parallel 
computer  is  clearly  far  from  optimal.  Conversely  an 
inefficient  sequential  algorithm,  such  as  finding  transitive 
closure  by  matrix  multiplication,  can  lead  to  an  efficient 
parallel  algorithm.  Moreover,  to  recognize  the  inherent 
parallelism  of  many  sequential  algorithms  is  not 
straightforward.  Therefore  it  is  a  significant  and  important 
problem  to  design  efficient  parallel  algorithms  for  the 
parallel  computers.  However,  it  is  only  recently  that  more 
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attention  has  been  paid  to  the  development  of  parallel  graph 
theoretical  algorithms  [ 2 ,  12,  13,  14,  19,  21,  28,  37], 

In  this  thesis,  we  concentrate  oar  study  on  graph 
theoretical  algorithms  with  polynomial  complexities  (i. e. , 
time,  processor  and  storage  complexities  are  bounded  by  a 
polynomial  in  the  size  of  the  problem) .  The  basic  idea  is  to 
identify  the  most  important  parts  of  an  algorithm  (i.e.  most 
time  or  processor  consuming) ,  and  put  most  effort  into 
optimizing  those  parts.  Based  on  this  idea,  the  processor 
bounds  on  most  of  the  existing  parallel  graph  algorithms 
(see  Table  1)  are  improved  and  in  some  cases,  we  show  that 
the  processing  power  is  optimally  utilized.  Also  new 
parallel  algorithms  for  several  graph  problems  are 
formulated. 

To  prepare  for  the  discussions  in  the  following 
chapters,  the  background  material  is  described  in  this 
chapter.  In  Section  1. 1,  the  classification  of  parallelism, 
the  theoretical  model  of  parallel  computation  and 
measurements  of  parallel  complexity  are  presented. 
Definitions  and  representation  of  graphs  are  given  in 
Section  1.2.  Previous  results  are  briefly  summarized  in 


Section  1.3 
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1.  1  Modeland  Measurements  of  Parallel  Computation 

Existing  parallel  computers,  such  as  matrix  computer 
ILL  I AC  IV,  pipeline  computer  CDC  STAR  100,  associative 
computer  Goodyear  STARAN,  and  multiple-processor  system 
UNIVAC  1110,  are  widely  different  in  their  architectures  and 
characteristics.  A  special  issue  of  ACM  Computing  Surveys 
[15]  provided  an  excellent  survey  of  these  parallel 
computers.  Parallel  computers  have  heen  categorized  in  many 
approaches  [5,16,42].  Broadly,  they  can  be  classified  by  the 
following  criteria: 

1.  General  or  special-purpose  system. 

2.  Synchronous  or  asynchronous  executions  of  the  operations. 

3.  Bounded  or  unbounded  parallelism.  The  former,  also  called 
K-parallelism  or  K-computation ,  refers  to  having  a  fixed 
number  of  K  processors  available  while  the  latter  has  an 
infinite  number  of  processors  available. 

4.  Single  or  multiple  instruction  streams.  In  the  first 
case,  all  processors  either  execute  or  ignore  the  current 
instruction  broadcasted  by  the  control  unit,  though  using 
different  data  and  depending  on  a  local  on/off  switch. 
This  is  called  Single  Instruction  St ream- Multiple  Data 
Stream  (SI MD) .  In  the  second  case,  processors  may  perform 
different  instructions  yeilding  the  M  ultiple  Instruction 
Stream-Multiple  Data  Stream  (MIMD) .  The  terms  SIMD  and 


MIMD  are  due  to  Elynn  [  16]. 
From  the  classification. 


one  can  easily  observe  that 


- 


,  DC  i. 


. 


•  i&ih )  M.22 


5 


parallelism  has  been  defined  on  various  levels  ranging  from 
the  bit  to  the  system  level.  At  a  lower  level,  it  may  refer 
to  arithmetic  operations  on  the  whole  word  rather  than  on 
one  bit  at  a  time.  At  a  higher  level,  called  the  algorithm 
level,  it  may  refer  to  parallel  executions  of  independent 
statements  composing  the  program.  Finally,  at  the  highest 
level,  it  may  refer  to  concurrent  processes  of  two  or  more 
conceptually  distinct  and  independent  programs.  In  what 
follows,  our  attention  is  limited  to  algorithm  level. 

Moreover,  the  structure  and  performance  of  an  algorithm 
will  not  solely  depend  on  the  problem  at  hand,  but  also 
depend  on  the  advantages  and  limitations  of  the  computer.  In 
order  to  remove  these  restrictions  and  exploit  the  inherent 
parallelism,  a  more  flexible  and  more  powerful  theoretical 
model  is  essential  which  is  discussed  in  the  following 
section. 


1 .  I .  I  Model  of  Parallel  Computation 

A  number  of  models  of  parallel  computation  have  been 
proposed  by  various  authors  [  2,  1  1,  13,  19,  20,  27  ].  The 

idealized  model  used  in  this  thesis  is  similar  to  Csanky's 
model  [13],  which  has  been  widely  used,  and  which  satisfies 
the  following  assumptions: 

(1)  An  arbitrary  number  (generally  bounded  by  a  polynomial 
in  the  size  of  the  problem)  of  identical  processors  and 
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a  sufficiently  large  memory  accessible  by  each 
processor  are  available. 

(2)  Instructions  are  always  available  for  execution  as 
required  and  are  never  held  up  by  the  central  control 
unit.  Processors  are  synchronized  and  all  instructions 
executed  in  parallel  are  identical  (SIMD). 

(3)  Initially ,  the  input  data  is  stored  in  the  memory.  At 
any  time,  two  processors  may  read  from  but  must  not 
write  into  the  same  memory  location  simultaneously. 

(4)  Each  processor  is  capable  of  performing  any  one  of  the 
arithmetic.  Boolean  and  comparison  operations.  At  any 
time,  each  processor  may  fetch  its  operands  from  the 
memory,  perform  an  operation  and  store  the  result  in 
the  memory  in  one  step  called  a  time  unit. 

(5)  No  memory  or  data  alignment  time  penalties  are 
incurred. 

In  reality,  all  parallel  algorithms  must  deal  with  the 
complex  problems  of  data  manipulation,  storage  allocation, 
memory  interference  and  interprocessor  communication.  An 
ideal  interconnection  network  for  communications  among  the  n 
processors  and  memory  should  directly  link  each  processor  to 
every  other  processor  and  memory,  but  this  is  far  too  costly 
for  large  n.  To  remedy  this  problem,  memory  is  partitioned 
into  units  and  different  restricted  networks  have  been 
proposed  [38]  which  cause  possibly  significant  communication 
delays.  For  instance.  Gentleman  [18]  pointed  out  that  in  the 
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two-dimensional  rectangular  grid  network,  communication 
delay  is  the  limiting  factor  for  matrix  manipulations. 

Nevertheless,  in  some  problems,  these  delays  can  be 
minimized  by  careful  redistribution  of  data  and  reindexing 
processors,  without  significantly  affecting  the  computation 
tim  e  [  1 ,  6  ,  25,  41]. 

The  unbounded  parallel  algorithms  presented  in  the 
following  chapters  can  be  transformed  to  produce  the  time 
bounds  on  the  corresponding  bounded  parallel  algorithms. 
This  is  demonstrated  through  an  example  in  Section  3. 1. 3, 
which  is  to  find  the  connected  components  of  an  undirected 
graph,  where  both  bounded  and  unbounded  complexities  are 
given.  Furthermore,  it  is  guite  easy  to  extend  the 
algorithms  presented  here  to  MIMD  systems  and  the  processor 
requirement  may  consequently  be  reduced.  Thus,  we  feel  that 
algorithms  developed  under  such  an  idealized  model  will  be 
influential  in  creating  algorithms  for  realizable  parallel 
computers. 


1.  1.2  Measurements  of  Parallel  Complexity 

The  parallel  time  complexity  of  the  computation  is  the 
least  number  of  time  units  necessary  to  produce  the  result 
rather  than  the  total  number  of  operations  performed  by  all 
processors.  Therefore  the  processor  requirement  is  regarded 
as  an  important  parameter.  It  is  of  considerable  practical 
interest  to  evaluate  the  effectiveness  of  parallelism.  In 
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order  to  measure  the  performance  of  a  parallel  algorithm, 
the  size  of  the  problem  (e.g.,  the  number  of  nodes  and  edges 
of  the  graph)  has  been  chosen  as  a  parameter  to  determine 
the  time,  processor  and  storage  requirements. 

Let  T (?)  be  the  number  of  time  units  required  by  a 
parallel  algorithm  using  P  >  I  processors.  The  speedup  of 
the  P-processors  computation  over  the  corresponding 
uniprocessor  computation  is  defined  as  S(P)  =  T(1)/T(P)  >  I, 
and  the  efficiency  which  indicates  the  utilization  of  the 
processing  power  is  defined  as  E(P)  =  S (P) /P  <  1. 

Our  first  objective  is  the  construction  of  efficient 
parallel  graph  algorithms  ideally  exhibiting  linear  (in  P) 
speedup.  This  situation  is  realized  only  when  all  P 
processors  are  loaded  in  each  time  units.  However,  linear 
speedup  is  seldom  achievable.  In  practice,  the  "fast” 
speedup  is  S (P)  =  0 (P/log  P) ,  which  is  acceptable  although 
less  than  linear.  For  those  existing  parallel  algorithms 
with  nonlinear  speedup,  our  second  objective  and  the  main 
achievement  of  this  thesis  is  the  minimization  of  the 
processor  requirement  without  increasing  the  time  complexity 
by  more  than  a  constant  factor. 
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1 .  2  Definitions  and  Representation  of  Graphs 
1.2.  I  Definitions 

An  undirected  qr aph  G  =  (V,E)  consists  of  a  finite, 
non-empty  set  V  of  n  elements  called  nodes  and  a  set  E  of  m 
unordered  pairs  of  nodes  called  edges  (v,w).  Gl  =  (VI,  El)  is 
a  subgraph  of  G  if  VI  C  V  and  El  C  E,  A  directed  graph 
D  =  ( V 1 ,  E  * )  is  defined  similarly,  except  that  edges  are 
ordered  pairs  of  nodes;  v  is  called  the  tail  and  w  the  head 
of  the  edge  (v,w).  If  (v,w)  €  E,  nodes  v  and  w  are  adjacent 
and  edge  (v,w)  is  incident  on  nodes  v  and  w;  whereas  if 
(v,w)  €  Ef,  v  is  said  to  be  adjacent  to  w  while  w  is 
adjacent  from  v,  and  the  directed  edge  (v,w)  is  incident 
from  v  and  to  w.  The  adjacency  matrix  of  a  graph  is  an  nxn 
matrix  A  of  O's  and  l*s,  where  the  (i ,  j)  —  th  element,  A(i,j), 
is  I  if  and  only  if  there  is  an  edge  from  node  i  to  node  j. 
The  weight  matrix  of  a  weighted  graph  is  an  nxn  matrix  f? 
such  that  W(i,j)  =  w(i,j)  if  there  is  an  edge  from  node  i  to 
node  j,  and  W(i,j)  =  c  otherwise,  where  w(i,j)  is  the  weight 
of  edge  from  node  i  to  node  j,  and  c  is  usually  0  or  oo  which 
depends  on  the  interpretation  of  the  weight  and  the  problem 
to  be  solved.  The  transitive  closure  A"  of  an  nxn  adjacency 
matrix  A  is  defined  as  A+A2+...  and  the  reflexive  transitive 
closure  of  A  is  defined  as  I+An  where  I  is  the  identity 
matrix,  i.e.,  I  (i,i)  =  I  and  I(i,j)  =  0  for  i  *  j  and 
I  <  i,  j  <  n. 

An  edge  is  called  a  se lf-loop  if  it  begins  and  ends  at 
the  same  node.  Two  edges  are  said  to  be  parallel  if  they 
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also  have  the  same  pair  of  end  nodes  (in  the  case  of 
directed  graph,  if  they  have  the  same  direction) .  A  graph  is 
simple  if  it  has  neither  self-loops  nor  parallel  edges.  Here 
and  hereafter,  the  graphs  under  consideration  are  simple, 
and  the  nodes  in  an  n-node  graph  are  labelled  with  the 
integers  1  through  n,  i.e.,  V  =  {l,2,...,n}. 

In  a  graph,  a  path  of  length  n - 1  with  endpoints  v  and  w 
is  a  sequence  of  nodes  v  =  y  ( 1 )  ,y  (2) ,. . . , y (n)  =  w  such  that 
(y  (i-  I)  ,y  (i) )  is  an  edge  for  I  <  i  <  n.  A  path  is  simple  if 
y ( 1) f y (2) ,. . . ,y  (n)  are  distinct  nodes.  A  cycle  is  a  simple 
path  from  v  to  v  containing  at  least  two  edges  for  directed 
graph  and  three  edges  for  undirected  graph.  A  graph  which 
contains  no  cycle  is  called  acyclic. 

G  is  connected  if  for  every  pair  of  distinct  nodes 
v,w  e  V,  there  is  a  path  from  v  to  w  (i.e.,  w  is  said  to  be 
reachable  from  v)  and  from  w  to  v.  A  connected  component  cf 
G  is  a  maximal  connected  subgraph  of  G,  i.e.,  it  is  not  a 
subgraph  of  any  other  connected  subgraph  of  G.  D  is  strongly 
connected  if  every  two  nodes  are  mutually  reachable;  it  is 
unilaterally  connected  if  for  any  two  nodes,  at  least  one  is 
reachable  from  the  other;  and  it  is  weakly  connected  if 
every  two  nodes  are  joined  by  a  path  in  which  the  direction 
of  each  edge  is  ignored.  A  strongly  connected  component  of  D 
is  a  maximal  strongly  connected  subgraph;  a  unilaterally 
connected  component  is  a  maximal  unilaterally  connected 
subgraph;  and  a  weakly  connected  component  is  a  maximal 
weakly  connected  subgraph.  Thus,  D  is  strongly,  unilaterally 
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or  weakly  connected  if  and  only  if  it  has  exactly  one 
corresponding  connected  component.  D  is  disconnected  if  it 
is  not  even  weakly  connected. 

An  undirected  acyclic  graph  is  called  a  forest.  A 
connected  forest  is  called  a  tree  which  has  a  distinguished 
node,  called  a  root  such  that  every  node  in  the  tree  is 
reachable  from  the  root.  A  terminally  (or  initially)  rooted 
tree  is  an  connected  directed  acyclic  graph  such  that  the 
root  has  no  leaving  (or  entering)  edges  and  other  nodes  have 
exactly  one  leaving  (or  entering)  edge.  A  spanning  tree  of  G 
is  an  undirected  tree  that  connects  all  nodes  in  V.  In  a 
forest,  if  there  is  a  path  from  v  to  w,  then  v  is  an 
ancestor  of  w  and  w  is  a  descendant  of  v.  Furthermore,  if 
(v,w)  €  E',  then  v  is  called  an  immediate  ancestor  of  w  and 
w  is  an  immediate  descendant  of  v. 

Throughout  this  thesis,  unless  specified  otherwise,  rxn 
denotes  the  smallest  integer  >  x  (ceiling)  ,  «-Xj  denotes  the 
greatest  integer  <  x  (floor) ,  |V|  denotes  the  cardinality 
(or  size)  of  any  set  V  and  log  n  denotes  rlog2n-,. 

1.2.2  Representation  of  Graphs 

In  a  computer,  the  graph  must  be  represented  in  a 
discrete  way.  A  variety  of  data  structures  have  been 
developed  for  this  with  respect  to  conventional  sequential 
computers.  The  convenience  of  implementation,  as  well  as  the 
officiency  of  a  graph  algorithm,  depends  on  the  proper 
selection  of  the  representation  of  the  graph.  Among  all  the 
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known  data  structures  for  representing  a  graph  G  =  (V,E) 
where  |V|  =  n  and  |E|  =  m,  the  adjacency  matrix  is 
particularly  appropriate  to  be  adopted  for  unbounded 
parallel  computation.  This  is  mainly  because  of  its  natural 
structure  which  provides  a  large  amount  of  parallelism  in 
matrix  manipulations  and  hence  the  fundamental  operations  on 
sets  can  be  done  efficiently  by  simply  taking  advantage  of 
the  matrix  indices.  Eor  example,  INSERT  edge,  DELETE  edge, 
UPDATE,  UNION,  FIND  and  NUMBER  can  be  done  in  one  time  unit, 
while  MI NIM UN  (finding  the  minimum  element  of  the  set)  can 
be  done  in  0(log  n)  time  units  which  is  indeed  optimal. 
Moreover,  initialization  and  reorganization  of  an  adjacency 
matrix  require  0(1)  and  0  (log  n)  time  units  respectively. 
Furthermore,  the  adjacency  matrix  is  relatively  easy  to 
implement  and  can  be  converted  to  ether  representations  of 
the  graph,  such  as  adjacency  lists,  or  trees,  in  0  (log  n) 
time  units.  In  contrast,  converting  adjacency  lists  to  an 
adjacency  matrix  takes  linear  time. 

The  only  weakness  of  the  adjacency  matrix 
representation  is  its  storage  requirements  which  are  always 
proportional  to  the  square  of  the  number  of  nodes  in  the 
graph  regardless  the  number  of  edges  in  the  graph.  For  dense 
graphs,  |Ej  =  0(n2),  the  difference  of  storage  requirement 
between  adjacency  matrix  and  other  graph  representations  is 
insignif icant.  For  sparse  graphs,  |Ej  =0  (n) ,  the  difference 
increases  to  0  (n)  . 
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1 . 3  Previous  Results 

Kith  the  advent  of  parallel  computers,  most  researches 
for  developing  and  analysing  parallel  algorithms  have  been 
concentrated  on  numerical  applications.  Hence,  several 
excellent  surveys  of  numerical  parallel  algorithms  have  also 
appeared.  Miranker  [29]  summarized  the  early  work  in  the 
late  sixties,  recently  Heller  [20]  has  presented  a  more 
complete  and  up-to-date  collection  of  parallel  algorithms  in 
numerical  linear  algebra,  Ortega  and  Voigt  [32]  gave  a 
detailed  account  of  algorithms  for  solving  differential 
equations  on  vector  computers,  and  Sameh  and  Kuck  [36] 
described  direct  parallel  algorithms  for  solving  systems  of 
linear  equations  to  a  greater  depth. 

Along  the  growth  of  development  of  parallel  algorithms, 
nevertheless,  graph  problems  have  received  little  attention. 
The  best  known  upper  time  and  processor  bounds  of  the 
existing  parallel  graph  algorithms,  which  are  bounded  by  a 
polynomial,  are  displayed  in  Table  I.  Since  matrix 
multiplication  and  sorting  are  the  useful  tools  in 
formulating  parallel  graph  algorithms,  their  current  best 
upper  tine  and  processor  bounds  are  also  included  in  Table 
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1 

1  Problem 

i  

~l - 

j  Time 

t -  — r 

J  Processor 

l 

|Sort  n  Elements 

i 

J  0 (log  n) 

j 

l 

J  nlog  n 

1 

|Find  the  Minimum  Element  of 
|a  Set  of  n  Elements 

I 

J  0 (log  n) 

J 

1 

J  rn/log  n-x 

1 

| Multiply  2  nxn  Matrices 

1 

( 0 (log  n) 

i 

|n2*81/log  n 

I 

|Eind  the  Transitive  Closure 

jof  an  Undirected  Graph  G 

« 

I 

J  0  (log2n) 

1 

1 

J  n  rn/log  n-, 

1 

JFind  the  Transitive  Closure 
|of  a  Directed  Graph  D 

1 

J  0 (log2n) 

J 

1 

j  n2 • 8  i/log  n 

1 

| Find  All-Pairs  Shortest  Path 

1 

J  0 (log2n) 

I 

Jn2rn/log  nn 

1 

|Find  an  Absolute  Sink 

( 0 (log  n) 

j 

|  n2 

! 

(Verify  Bipartite  Graph 

1 

J  0 (log2n) 

I  n3 

1 

(Find  a  Spanning  Tree  of  G 

1 

J  0 (log2n) 

i 

|nrn/log  nn 

1 

(Find  the  Minimum  Spanning 
| Trees  of  G 

i 

J  0 (log2n) 

1 

i 

J  n  rn/log  nn 

1 

JFind  the  Connected  Components 
Jof  G 

1 

J  0  (log2n) 

1 

i 

|  n  rn/log  nn 

1 

(Find  the  Biconnected 
JComponents  of  G 

i 

J  0  (log2n) 

1 

|n2rn/l°g  nn 

1 

JFind  the  Bridge  Connected 
IComponents  of  G 

1 

|  0  (log2n) 

J 

j 

J  n2  plo g  iii 

1 

JFind  the  Weakly  Connected 
JComponents  of  D 

J  0 (log2n) 

J 

j 

J  n3 

JFind  the  Strongly  Connected 
(Components  of  D 

J  0 (log2n) 

1 

j 

j  n3 

1 

(Find  the  Dominators  of  D 

j  0 (log2n) 

i 

J log  n (n3  *  8 1 ) 

1 

JFind  a  Cycle  of  G 

| 0 (log2n) 

j 

1  n2 

JFind  a  Cycle  Basis  of  G 

| 0 (log2n) 

1 

J  n3 

1 

JFind  the  Shortest  Cycle  of  D 

- - 

J  0 (log2n) 

i - 

Jn2rn/log 

_j - — — 

Table  1  :  Summary  of  previous  results  on  graph  problems 


Bef .  | 

- 1 

[34  H 

I 

[  3  7  ]  | 

I 

I 

L  *0  ]| 

I 

[  37  ]  | 

1 

I 

[  ion 

J 

I 

[  3  7  ]  | 

I 

[1931 

I 

[  19  ]| 

1 

[  3  7  ]| 

1 

[  37  ]  J 
1 
I 

C  37  II 

I 

I 

C  37  31 

I 

1 

[  37  ]  | 

I 

I 

[2]  I 
1 
1 

[2]  I 
! 
I 

[37  ]! 

I 

£37  31 

I 

[  37  31 

I 

£37  31 


I » 


— 


— 


CHAPTER  2 


LOWER  BOUNDS  AND  BASIC  ALGORITHMS 

2 .  I  Lower  Bounds 

As  an  immediate  consequence  of  our  model  of  unbounded 
parallel  computation,  the  maximum  parallelism  is  achievable 
when  the  steps  of  a  computation  are  completely  independent, 
i.  e« ,  all  operands  are  available  at  the  same  time.  Since 
each  processor  uses  only  unary  or  binary  operations,  a 
simple  lower  bound  on  the  computation  time  can  be  concluded 
below. 


Lemma  2.1:  Computing  a  function  in  n  steps  on  a  single 
processor  can  be  accomplished  in  one  step  on  n  processors  if 
and  only  if  the  operands  of  each  step  are  independent. 

Proof  :  The  proof  is  trivial.  □ 

The  independency  of  operands,  however,  may  not  always 
exist.  One  example  is  to  compute  A  (n)  =?  a  ( I)  ©a  (2)  .  @a  (n)  , 
where  ©  is  any  associative  binary  operation.  This  problem  is 
the  simplest  instance  of  combining  n  inputs  into  one  output. 
A(n)  can  be  computed  in  log  n  time  units  with  *-n/2J 
processors  by  using  recursive  doubling,  i.e.,  repeatedly 
separate  each  computation  into  two  independent  parts  of 
egual  size  which  are  then  computed  in  parallel.  This  method 
is  actually  a  parallel  version  of  divide-and-conguer  which 


15 


.  uL  d 


.If.  : 


. 


16 


can  be  represented  by  a  binary  tree,  called  binary 
confutation  tree,  in  which  the  root  corresponds  to  the 
result,  the  terminal  nodes  correspond  to  the  input  operands, 
the  internal  nodes  correspond  to  the  operations  and  the 
height  (depth)  of  the  tree  corresponds  to  the  number  of 
parallel  operations  performed. 

Ihe  following  facts  are  immediate  from  a  binary  tree. 

Lemma  2.2:  A  binary  tree  with  depth  i  -  1  has  at  most 
2 i  -  1  nodes. 

Lemma  2.3;  A  binary  tree  with  n  nodes  has  depth  at 
least  log  n. 

These  facts  lead  to  the  following  well-known  lower 
bound  for  computing  A(n)  with  unbounded  parallelism  which 
has  been  referred  to  as  the  fan-in  argument.  It  was 
generalized  to  the  case  with  K  processors  available  by  Munro 
and  Paterson  [  30  ]. 

Lemma  2.  4  (fan-in  argument)  :  At  least  rlog  n-i  parallel 
operations  are  required  to  compute  a  result  which  depends  on 
n  arguments. 

Theorem  2.5  :  Suppose  the  computation  of  a  single 
element  Q  requires  g  >  1  binary  arithmetic  operations.  Then 
the  shortest  computation  of  Q  with  K  processors  is  at  least 
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r(q+  I-2M  /Kn  +i  time  units  if  q  >  2*  and  log(q+l)  otherwise, 
where  i  =  log  K. 

Despite  its  simplicity.  Theorem  2.5  is  very  important 
because  it  allows  the  translation  of  complexity  theorems 
from  sequential  computations  to  parallel  computations.  Based 
on  the  fan-in  argument,  Arjomandi  [2]  showed  that  the  lower 
time  bound  on  verifying  connectivity  in  an  n-node  undirected 
graph  G  is  log(n-4)-1  while  the  lower  time  bounds  on 
verifying  strong,  unilateral  and  weak  connectivity  in  an 
n-node  directed  graph  D  are  log(n-3)-2,  log  (n-5) -  I  and 
log(n-4)-1  respectively.  Also,  Savage  [37  p,  115-120]  proved 
that  the  lower  time  bounds  on  determining  biconnectivity, 
bridge  connectivity,  minimum  spanning  trees  and  a  cycle  in 
G,  and  a  dominator  and  a  cycle  in  D  are  all  21og  n  with  at 
most  a  constant  difference. 

Intuitively,  in  order  to  determine  various  graph 
properties,  in  the  worst  case,  all  entries  in  the 
corresponding  adjacency  matrix  of  the  graph  (i. e.  all  edges 
in  the  graph)  have  to  be  examined.  For  many  classes  of  graph 
properties,  various  researchers,  namely  Best  et  al.  [8], 
Holt  and  Reingold  [22],  Kirkpatrick  [24],  and  Rivest  and 
Vuillemin  [35]  have  shown  sequential  lower  time  bounds  of  at 
least  0  (n2) .  Hence,  the  parallel  lower  time  bound  of 
log  n2  =  21og  n  follows  immediately  from  the  fan-in  argument 
which  gives  a  good  estimate. 
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2 . 2  Basic  Algorithms 

Computation  of  A  (n)  =  a  ( 1)  0a  (2)  0.  .  .6a  (n)  and  matrix 
multiplication  have  been  the  bottlenecks  in  solving  most  of 
the  graph  problems,  where  ®  is  any  associative  binary 
operation.  Therefore  reductions  of  time  and  processor 
requirements  on  these  two  bottlenecks  will  substantially 
reduce  the  overall  time  and  processor  requirements.  In 
Section  2.2.  I,  we  present  two  optimal  algorithms  for 
computing  A  (n) .  In  Section  2.2.2,  previous  results  on  matrix 
multiplication  and  its  applications  are  discussed.  Moreover, 
we  will  point  out  that  the  processor  bound  of  many  existing 
graph  algorithms,  which  are  based  on  matrix  multiplication, 
can  be  reduced  simply  by  using  the  current  best  matrix 
multiplication  algorithm. 


2.2.1  Computation  of  A (n) 

In  the  preceding  section,  we  have  shown  that  A(n)  is 
computable  in  log  n  time  units  with  at  most  Ln/2J  processors 
using  recursive  doubling.  By  inspection,  the  maximum  number 
of  processors  merely  occurs  at  the  first  level  of  the  binary 
computation  tree.  In  the  subsequent  levels,  half  of  the 
number  of  processors  working  at  any  level  are  idle  at  the 
next  level.  As  a  result,  at  the  last  level,  only  one 
processor  is  working.  Clearly,  if  «-b/4-*  processors  are 
available,  the  computation  time  of  A(n)  only  increases  one 

first  level  takes  2  time  units  instead  of 


time  unit  (i.e. 
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one  time  unit) .  This  indicates  that  it  is  not  always 
desirable  to  reduce  the  computation  time  to  an  absolute 
minimum.  Having  in  mind  our  second  objective,  namely 
processor  optimization,  it  is  natural  to  ask: 

(A)  whether  wiser  utilization  of  processors  can  be 
organized  so  that  the  processor  requirement  can  be 
minimized  without  affecting  the  0(log  n)  time  bound 
significantly? 

In  practice,  a  related  question,  which  has  received 
much  attention,  is  addressed  as  follows: 

(B)  Given  P  processors,  at  most  how  many  time  units  are 
needed  to  compute  A  (n) ? 

An  answer  to  (A)  can  be  obtained  by  interpolating 
different  P  into  (B) .  Thus,  we  concentrate  our  attention  on 
tackling  (B) . 

For  a  problem  of  size  n,  suppose  there  exists  an 
algorithm  using  t  time  units,  P  processors  and  g  operations. 
In  order  to  transform  the  existing  algorithm  into  one  that 
requires  a  smaller  number  of  K  processors,  two  principles, 
namely  algorithm  decomposition  and  problem  decomposition  due 
to  Hyafil  and  Kung  [23],  are  introduced.  The  idea  of 
algorithm  decomposition  is  to  decompose  each  step  i  of  the 
existing  algorithm  into  rg(i)/Kn  substeps  so  that  each  of 
them  can  be  done  in  one  time  unit  with  K  processors,  where 
q(i)  is  the  number  of  operations  in  step  i.  Hence,  the  total 
time  T  is  bounded  as  follows  [9]: 
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t 

T  =  >  rq(i)/Kn  <  t+(q-t)/K 
i=  1 


t 

where  >  q(i)  =  q 
i=  I 


The  idea  of  problem  decomposition  is  to  partition  the 
problem  into  subproblems  so  that  each  of  them  can  be  solved 
by  the  existing  algorithm  with  K  processors.  Based  on  this 
idea.  Savage  [3  7  p «  1 7  ]  developed  an  algorithm  for  finding 
the  minimum  of  n  elements  in  less  than  21og  n  time  with 
rn/log  n-|  processors,  which  is,  in  fact,  the  best  answer  to 
(A)  . 

To  answer  (B) ,  we  construct  two  different  algorithms 
below  which  basically  simulate  the  binary  computation  tree 
based  on  the  above  stated  principles.  The  first,  named 
Algorithm  A  (n)  1,  incorporates  algorithm  decomposition  with  a 
circular  queue  which  is  used  to  store  the  operands.  It  is  a 
variant  of  Heller's  associative  fan-in  algorithm  [20]. 
Assume  that  input  operands  are  stored  in  a  circular  queue  of 
size  n  and  K  processors  are  available. 

Algorithm  A  (n)  I 

1,  while  more  than  one  operand  in  the  circular  queue  do 

2.  Fetch  two  operands  from  the  circular  queue  to  each 
processor  and  fill  up  as  many  processors  as  possible, 
execute  the  specified  ®  operation,  and  then  store  the 
results  to  the  end  of  the  circular  queue. 

Under  the  assumption  of  our  model  of  parallel 
computation,  step  2  can  be  done  in  one  time  unit.  The  total 
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time  requirement  T  for  computing  A(n)  is  equal  to  the  number 
of  iterations  of  step  2.  At  each  iteration,  K  operands  are 
reduced.  After  «-n/K-»  -  1  iterations,  K*-n/K-»  -  K  operands  are 
reduced  and  only  r  =  n  +  K  -  K*-n/KJ  operands  remain  which 
can  be  done  in  another  log (n  +  r)  time  units  to  produce  the 
final  result.  Thus, 

If  K  <  Ln/2J  ,  T  =  *-n/KJ  -  I  +  log  (K+n-K  *-n/KJ ) 

If  K  >  i-n/2-*  *  T  =  lo9  n 

In  order  to  show  these  time  bounds  are  indeed 
equivalent  to  the  lower  time  bounds  shown  in  Theorem  2.5,  we 
give  the  following  lemma. 

Lemma  2.6:  II  =  T2  for  all  n  where 
r(n-2i)/KT  +  i  if  n~ 1  -  2'x 

T!  = 

lino  n  otherwise 


if  »-n/2-»  >  K 


otherwise 


i=  log  K  and  0  <  r  =  n-K«-n/K->  <  K 
Proof:  Since  21  >  K  >  2V2 


if 


n-1  <  2i 


==>  2K  >  2 i  >  n-l  ==>  K  >  Ln/2J 
Therefore  there  are  only  three  major  cases  to  consider  based 


on  the  range  differences. 

Case  1:  n-l  <  21  and  Ln/2J  i  K 


T 1  =  T2  =  log  n 


Case  2:  n-l  >  2»  and  «-n/2J  <  K 


>  2x2^  >  2K  >  2  Ln/2J  >  21 
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==> 

i+1  =  log  K  +  1  =  log  n  and  K  >  n-2*  >  1 

Hence 

T  1  =  r(n-2i)/Kn  +  i  =  1  +  log  n  -  1 

=  log  n 

=  T2 

Case  3: 

n- 1  >  2i,  Ln/2J  >  K  and  0  <  r  =  n-Ki-n/K-»  <  K 

There  are  two  minor  cases  to  consider  based  on  whether  K  is 
power  of  2.  Two  subcases  arise  when  considering  whether  n  is 
a  multiple  of  K. 

Case  3.1:  K  =  21 

Case  3.  la:  n/K  =  t  =  integer 


=  =  > 

rn/Kn  =  Ln/K-J  =  t  and  r  =  0 

Hence 

T 1  =  r(n-K) /Kn  +  i  =  t  -  1  +  i 

and 

T2  =  i-n/KJ  -  1  +  log  (K)  =  t  -  1  +  i 

Case  3. 1b:  n/K  #  t 


==> 

rn/Kn  =  Ln/KJ  +  1  and  K  >  r  >  0 

==> 

log  2K  =  log  (K+r)  =  1  +  log  K  =  1  +  i 

Hence 

T 1  =  r(n-K)/KT  +  i  =  i-h/Kt  -  1  +  i  =  Ln/KJ  +  i 

and 

T2  =  Ln/KJ  -  1  +  log  (K  +  r)  =  Ln/KJ  +  i 

*  • 

Case  3.2:  2*  >  K  >  2*/2  ==>  2*-K  <  2i/2  <  K 

Case  3.2a:  n/K  =  t 


=  =  > 

rn/K-i  =  Ln/KJ  =  t  and  r  =  0 

Hen  ce 

T  1  =  r(n-2M/KT  +  i 

=  r(n-K)/Kn  +  r(K-2i)/Kn  +  i  =  t  -  1  +  i 

and 

T2  =  ‘-n/K-*  -  1  +  log  (K)  =  t  -  1  +  i 

Case3.2b:  n/K  *  t 


=  =  > 

rn/KT  =  (n-r)/K  +  1  =  Ln/KJ  +  1  and  K  >  r  >  0 

Since 

2x2i  >  2K  >  K+r  ==>  i+ 1  >  log  (K+r) 
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if  K+r  >  2 i 

==>  K+r-2^  <  K  and  log  (K+r)  =  i+1 

then  Tl  =  r(n-K-r)/K-,  +  r  (K+r-2  * )  /K-j  +  i  =  «-n/KJ  +  i 
T2  =  Ln/KJ  -  1  +  log  (K+r)  =  lr/kj  +  i 
if  K+r  <  2i 

==>  2i-K-r  <  K  and  log  (K+r)  =  i 
then  Tl  =  r(n-K-r)/Kn  +  r(K  +  r-2i)/Kn  +  i  =  «-n/KJ  -  I  +  i 
T2  =  »-n/KJ  -  1  +  log  (K+r)  =  Ln/KJ  -  1  +  i 
Having  shown  T  1  =  T2  in  all  cases,  the  proof  is  completed. 

0 


Hence,  in  both  cases,  the  exact  minimum  number  of  time 
units  are  achieved  and  thus  the  time  bounds  of  Algorithm 
A (n)  I  are  tight. 

The  second,  named  Algorithm  A(n)2,  is  a  generalized 
version  of  Savage's  partition  method  for  finding  the  minimum 
of  n  elements  [37  p.  17  ]. 

^i3<2£i£hm_ A  (n)  2 

1.  if  K  >  Ln/2J  then  use  recursive  doubling  to  compute  A(n). 

2.  Partition  the  problem,  A(n),  into  K  subproblems, 
S(1)  ,S(2)  ,..  .  ,S  (K)  ,  of  Ln/KJ  operands  each  and 
0  <  r  =  n-K Ln/KJ  <  K  operands  remain.  For  example, 
S  ( 1 )  =  a  (  I)  @a  (2)  @. .  .  ®a  (  Ln/KJ  ) .  Assign  one  processor  to 
each  S (i)  and  then  compute  all  S  (i)  in  parallel. 

3.  Use  recursive  doubling  tc  compute  A(n)  with  all  S(i)  and 
the  r  remaining  operands  as  input. 
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If  K  >  Ln/2J  •  clearly  step  1  can  be  done  in  log  n  time 
units  by  using  recursive  doubling  with  at  most  «-n/2-» 
processors.  If  K  <  n/2J  and  0  <  r  <  K,  in  step  2,  each  of 
the  K  processors  compute  S ( i)  sequentially  in  at  most 
Ln/KJ-l  time  units,  whereas  step  3  can  be  done  in  log(K  +  r) 
time  units  with  at  most  K-1  >  (K+r)/2  processors  by  using 
recursive  doubling.  Hence  the  total  time  requirement  is 
Ln/KJ  -  H-log  (K  +  r)  .  In  both  cases,  the  same  time  bounds  as 
Algorithm  A (n) 1  are  achieved.  Thus  we  have  established  the 
following  theorem. 

Theorem  2. 7:  Given  K  processors.  Algorithm  A(n)  I  or 
A(n)2  computes  A(n)  in  Ln/KJ  -  1  +  log  (K+r)  time  units  if 

«-n/2J  >  K  and  log  n  time  units  if  Ln/2J  <  K,  where 
r  =  n  -  K  Ln /KJ •  Moreover,  the  time  bounds  in  both  cases  are 
tight. 


By  substituting  K  —  j-n/log  Ut  and  ®  —  Min  into  Theorem 
2.7,  we  obtain  T(?)  =  21og  n  -  loglog  n  -  1  with 
S(P)  =  (n-l)/(21og  n-loglog  n-l)  and  E(P)  =  0(1),  which  is 
basically  Savage's  result.  By  interpolating  different  K's 
into  Theorem  2.7,  one  can  easily  verify  that  K  =  i-n/log  nn 
is  minimal  subject  to  the  constraint  of  0 (log  n)  time  units. 
Therefore  it  is  the  best  answer  to  (A)  .  Also,  this  result 
can  be  directly  applied  to  reduce  the  processor  requirement 
of  Goldschlager ' s  Algorithm  for  finding  the  absolute  sink 
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(i. e« ,  node  has  in-degree  n  -  1  and  out-degree  0)  [19]  from 
n2  to  nrn/log  n-y  with  the  same  time  requirement  (i.e., 
0 (log  n) )  . 


2.  2. 2  Matrix  Multiplication 

Multiplying  2  nxn  matrices  in  our  model  of  unbounded 
parallel  computation  by  the  "high  school"  method,  which  can 
be  regarded  as  computing  n2  vector  inner  products  in 
parallel,  takes  0  (log  n)  time  units  with  n3  processors  if 
simultaneous  read  is  allowed.  Combining  Algorithm  A(n)  1  or 
A(n)2  with  the  high  school  method,  the  processor  bound  can 
be  lowered  to  n2rn/log  n-,  .  Csanky  [13]  developed  the  first 
parallel  version  of  Strassen*s  matrix  multiplication 
algorithm  [39]  which  runs  in  0  (log  n)  time  units  with  n2*81 
processors.  Recently,  a  more  general  parallel  version  of 
Strassen ' s  algorithm,  which  runs  in  0(n2*81/P)  time  units 
with  P  <  n2*81/lcg  n  processors,  S  (f )  =  P  and  E(P)  =  0(1), 
has  been  constructed  by  Chandra  [10].  A  practical  advantage 
of  Chandra* s  algorithm  is  free  of  memory  conflict. 

Furman  [17]  and  Munro  [31]  independently  observed  that 
the  transitive  closure  of  an  nxn  Boolean  matrix  A  can  be 
accomplished  by  matrix  multiplications  with  the  two 
operations  x  and  +  of  inner  product  being  changed  to  logical 
AND  and  OR  correspondingly.  Moreover,  Munro  [31]  presented 
an  0(n2*81)  sequential  algorithm  which  essentially  requires 
one  matrix  multiplication.  By  squaring  A  log  n  times,  the 
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transitive  closure  of  A  can  Jbe  computed  in  0(log2n)  time 
units  with  n3  processors  using  the  high  school  method  £21], 
and  in  0(n2-8i(log  n)/p)  time  units  with  P  <  n2-8Vlcg  n 
processors,  S(P)  =  0 (P)  and  E(P)  =  0(1)  using  Chandra's 
parallel  Strassen's  algorithm  [10].  The  speedup  and  its 
efficiency  over  the  best  sequential  algorithm  are 
S (P )  =  0 (P/log  n)  and  E  (P)  =  0(  1/log  n) .  Employing  high 
school  method  to  compute  transitive  closure,  Goldschlager 
[19]  verified  bipartiteness  and  Arjomandi  [2]  determined  the 
connected  components  in  0  (log2n)  time  units  with  n3 
processors.  The  processor  requirements  for  both  problems  can 
be  decreased  to  n2»81/log  n  if  Chandra's  algorithm  is  used. 
Analogously,  the  log  n(n3*81)  processor  requirement  of 
Savage's  0(log2n)  algorithm  for  finding  dominators 
[37  p.70-71]  can  be  improved  to  n3*81/log  n  by  using 
Chandra's  algorithm. 

If  A  is  a  symmetric  matrix,  the  problem  of  finding  the 
transitive  closure  reduces  to  identifying  its  connected 
components  and  was  efficiently  solved  by  Savage  [37  p. 50]  in 
0 (log2n)  time  units  with  nrn/log  nn  processors.  The 
processor  requirement  of  her  method,  indeed,  can  be  further 
reduced  to  nrn/log2ni.  vie  will  discuss  the  details  in 

Section  3.  1.4. 

By  replacing  +  and  Min  with  x  and  +  in  performing  inner 
product  denoted  ®,  Backhouse  and  Carre  [4]  extended  the 
matrix  multip  method  to  find  the  all-pairs  shortest  path  of 
G  if  no  negative  cycles  are  present.  Savage  [37  p. 104-109] 
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presented  an  0{log2n) 

algorithm 

for 

this  problem  with 

n2rn/log  nn  processors. 

She  then 

used 

it  to 

develop 

an 

0(log2n)  algorithm  for 

determining 

the 

shortest 

cycle  in 

an 

n-node  directed  graph  using  n2rn/log  nn  processors.  However, 
if  Chandra’s  algorithm  is  employed,  the  processor 
requirements  of  the  above  two  problems  can  both  be  lowered 
to  n2 • 81/log  n. 

Furthermore,  Lawler  [26  p. 9  I  ]  pointed  out  that 
detecting  the  existence  of  negative  cycles  or  finding  the 
length  of  the  shortest  cycle  become  a  by-product  of  the 
matrix  multiplication  method.  For  detecting  the  existence  of 
negative  cycles  in  an  n-node  weighted  graph  G:  after 
squaring  the  weight  matrix  W  of  G  log  n  times,  if  the 
minimum  of  the  diagonal  elements  of  the  resultant  weight 
matrix  is  negative,  negative  cycle  exists.  The  parallel 
realization.  Algorithm  NEG. CYCLE,  is  presented  below. 

Algorithm  NEG. CYCLE 

comment:  find  the  transitive  closure  of  the  weight  matrix 
W  by  Chandra's  algorithm. 

1.  for  i=  1  until  leg  n  do  W  < —  Min{W,W©K} 

2.  if  Min  {ft  (i,i)  I  Vi} <0  then  return  ’Negative  cycle* 

else  return  'No  negative  cycle' 

Step  I  of  Algorithm  NEG. CYCLE  can  be  computed  in 
0{log  n(n2»81)/P)  time  units  with  P  <  n2*81/log  n  processors 
by  using  Chandra's  algorithm,  while  step  2  can  be  determined 
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in  log  n  +  I  time  units  with  at  most  *-n/2J  processors. 
Hence,  the  total  time  and  processor  bounds  are 
0(log  n(n2»81)/P)  and  P  <  n2*81/log  n  respectively. 

For  finding  the  length  of  the  shortest  cycle:  after 
squaring  the  adjacency  matrix  log  n  times,  the  smallest 
value  of  the  diagonal  elements  of  the  resultant  adjacency 
matrix  is  the  length  of  the  shortest  cycle.  The  same  time 
and  processor  bounds  as  detecting  negative  cycle  are 
obtained. 


CHAETEE  3 


SOME  GBAPH  CONNECTIVITY  PEOBLEMS 

In  this  chapter,  we  consider  problems  dealing  with  the 
connectivities  in  graphs.  It  has  been  shown  in  Section  2. I 
and  Table  1  that  the  lower  and  upper  time  bounds  on 
verifying  different  connectivities  in  graphs  are  0  (log  n) 
and  0(log2n)  respectively.  If  a  graph  is  disconnected,  it  is 
desirable  to  identify  all  the  corresponding  connected 
components.  By  definition,  a  graph  G  is  connected  if  and 
only  if  there  exists  exactly  one  connected  component  in  G. 
Whence,  verifying  connectivity  in  G  becomes  a  consequence  of 
finding  the  corresponding  connected  components.  Besides 
Algorithm  ONI. CON  and  Algorithm  ACYCLIC,  which  are  merely 
for  verifying  unilateral  connectivity  and  acyclicness, 
algorithms  presented  in  this  chapter  are  mainly  focused  on 
the  problems  of  finding  different  connected  components. 
Problems  in  undirected  and  directed  graphs  are  discussed 
separately  in  Sections  3.  1  and  3.  2. 
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3  •  1  find i n^-C on n e c t ed_C om^o n eats  of  Undirected  Graphs 

Based  on  high  school  method  to  compute  the  reflexive 
transitive  closure  of  an  nxn  matrix  in  0(log2n)  time  units 
on  n3  processors,  Arjomandi  [2]  solved  the  problem  of 
finding  the  connected  components  of  an  n-noae  undirected 
graph  in  0(log2n)  time  units  on  n3  processors  which  can  be 
lowered  to  n2,81/log  n  processors  by  using  Chandra’s 
parallel  Strassen's  algorithm.  Later,  Hirschberg  [21] 
reduced  the  processor  requirement  to  n2  while  maintaining 
the  0(log2n)  time  units.  Recently,  Savage  [37  p.  49 ]  has 
improved  the  processor  bound  of  Hirschberg’s  method  to 
nrn/log  nn .  We  are  going  to  show  that  this  processor  bound, 
in  fact,  can  even  further  be  improved  to  nrn/log2nn  with 
linear  speedup  and  efficiency  0(1).  Hirschberg’s  algorithm 
is  firstly  described  in  Section  3.1.  1  and  the  corrections  of 
his  algorithm  are  discussed  in  Section  3.1.2.  The  improved 
algorithm  is  presented  in  Section  3.  1.3  and  its  applications 
are  mentioned  in  Section  3.1.4. 


3.1,1  General  Hirschberg’s  Algorithm 

In  general,  Hirschberg* s  Algorithm  CONNECT  (see  figure 
I)  for  finding  connected  components  of  an  undirected  graph  G 
can  be  interpreted  as  follows: 

Step  1.  (Uniform  Smallest  Incident  Node  Selection) :  For  each 
node  i,  select  the  ’’minimum*'  edge  (i/j)  where  node  j 
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ALGORITHM  CONNECT 


begin  comment : nxn  array  A  is  the  input  adjacency  matrix. 

Connected  components  are  labelled  by  the  smallest 
node  label  in  themselves.  Vector  D  of  size  n 
stores  the  output  s.t.  D  (x)  indicates  to  which 
connected  component  node  x  belongs. 

I.  Vx  do 

begin  D (x)  < —  x 

C  (x)  < —  Min  {y^x|  A (x,y) =  1} 

Flag  (x)  < —  I 

end 


for  i  =  I  until  log  n  do 

2.  begin  Vx  s.t.  Flag  (x) =0  do 

begin  D  (x)  < —  Min  {B  (x)  ,  D[  C  (x)  ]} 
for  j  =  I  until  log  n  do 
begin  C  (x)  < —  C[C(x)  ] 

D  (x)  <—  D[C  (x)  ] 

end 

end 

3.  Vx  s.t.  Flag(x)=0  do  D(x)  < —  D[D(x)] 


4. 


5. 


Vx  do 

begin  A[x,D(x)  ]  < —  1 
A[  D  (x)  ,x  ]  < —  1 

end 

Vx  do  if  B(x)*x  then  Flag(x)  < —  0 


6. 


Vx  s.t.  Flag  (x)  =0  do 

C  (x)  <—  Min{D{y)*D(x)  |A<x,y)  =  1}  if  none  then  D(x) 


7. 


Vx  s.t.  Flag (x) = 1  do 
C  (x)  < —  Min  {C  (y)  (x)  |  A  (x,y)  -  1} 

end 

end 


if  none  then  D(x) 


Figure  1:  Hirschberg's  Algorithm  CONNECT  for  Finding  the 
Connected  Components 
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has  the  smallest  node  label  among  all  the  nodes 
incident  upon  node  i.  The  group  of  nodes  that  are 
connected  by  the  selected  edges  defines  a  forest. 
(Isolated  nodes,  not  incident  to  any  edges  selected 
so  far,  are  regarded  as  trees  with  one  node. ) 

Step  2.  (Path  Compression):  Identify  the  trees,  i.e., 

contract  each  tree  to  a  single  node  called  center 
which  is  actually  the  smallest-labelled  node  in  the 
tree.  Repeat  steps  I  and  2  with  the  new  graph 
defined  by  the  centers  as  nodes  until  there  is  no 
edge  in  the  new  graph. 

Step  3.  (Clean  Up) :  (This  optional  step  may  be  executed 
after  any  step  2.  It  simplifies  future  computations 
by  deleting  superfluous  edges.)  Delete  all 
unselected  edges  with  both  endpoints  in  the  same 
center.  For  each  pair  of  centers  connected  by  more 
than  one  unseiected  edges,  delete  all  but  the 
”minimum,,  of  such  edges  connecting  the  pair  of 
centers. 

A  tree-loop  is  a  terminally  rooted  tree  which  has  an 
additional  edge  going  from  the  root  to  one  of  its  ancestors. 

Thus,  in  a  tree-loop,  the  number  of  nodes  is  egual  to  the 

number  of  directed  edges. 

In  step  I,  the  Uniform  Smallest  Incident  Node  Selection 
basically  defines  a  spanning  forest  of  at  most  Ln/2-» 

tree-loops  each  having  at  least  two  nodes.  Each  isolated 
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node,  not  incident  to  any  node  so  far,  is  a  connected 
component.  It  will  remain  isolated  in  the  subsequent 
processes  and  may  be  considered  inactive,  i.e.  ,  no  processor 
is  required  to  perform  any  further  operation  for  the 
isolated  node.  The  trick  in  step  I  is  the  smallest-labelled 
incident  node  selecting  rule  -  a  symmetric  relation  -  which 
assures  that  the  smallest- la  belled  node  (center)  of  each 
tree  is  in  the  loop  and  the  immediate  descendant  of  the 
center  is  also  the  immediate  ancestor  of  the  center.  Whence 
the  loop  is  assured  to  be  of  length  two.  Path  compression  in 
step  2  is  the  crux  of  this  algorithm.  At  each  iteration,  the 
immediate  descendants  of  all  the  nodes  are  connected  to 
their  immediate  ancestors.  Hence,  the  minimum  length  path 
i — >j  is  shorter  by  a  factor  of  two.  In  other  words.  Path 
Compression  can  also  be  viewed  as  relabelling  nodes  on  the 
common  path  with  the  center  label.  This  technique  is  used  to 
"bring  together"  nodes  on  a  common  path  to  the  center.  Since 
the  longest  possible  path  is  of  length  n- I ,  at  most 
log  (n-1)  iterations  is  required.  A  new  graph  is  now 
established  with  at  most  «-n/2J  non-isolated  nodes  (centers). 
The  process  will  be  repeated  until  all  the  centers  are 
determined  to  be  either  connected  or  isolated.  Hence,  the 
algorithm  correctly  finds  the  connected  components  in  any 
undirected  graph.  The  foregoing  reasoning  is  summarized  in 
the  following  seguence  of  lemmas  where  details  and  proofs 
can  be  found  in  [21]: 
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Lemma^^J:  The  smallest-labelled  node  (center)  in  a 
tree-loopr  defined  by  the  Uniform  Smallest  Incident  Node 
Selection,  will  be  in  the  loop,  and  the  loop  will  be  of  size 
two. 


Lemma 

3.  2: 

All 

the  nodes  in 

the 

tree 

-loop 

can  he 

relabelled 

by 

the 

center  label 

in 

at 

most 

log  (n-  1) 

iterations. 

Lemma 

3.3: 

After 

each  iteration 

of 

steps 

1  and 

2  in  the 

general  Hirschberg's  algorithm,  the  number  of  centers  within 
each  connected  component  will  decrease  by  at  least  half  in 
those  components  that  have  more  than  one  center. 

On  the  unbounded  parallel  computation  model,  finding 
the  smallest  incident  node  by  recursive  doubling  and  path 
compression  can  be  computed  in  0(log  n)  time  with  nLn/2J  and 
n  processors  respectively.  Since  the  size  of  the  graph 
(actually  the  number  of  non-isolated  nodes)  is  reduced  by  at 
least  half  after  each  iteration,  the  process  will  be 
repeated  at  most  log  n  times.  Thus  the  algorithm  requires  at 
most  0(log2n)  time  units  with  nLn/2J  processors. 


3.  1 . 2  C crrect ion s_gf_Hirschber^s_ Algorithm 

In  [21],  Hirschberg  presented  a  parallel  algorithm  for 
finding  the  connected  components  (see  Figure  I,  Algorithm 
CONNECT) .  Three  vectors  of  size  n  are  used  to  implement  Path 
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Compression:  vector  Flag  indicates  the  current  nodes  in  the 
new  graph;  vector  D  stores  the  current  label  of  the  nodes; 
and  vector  C  points  to  the  immediate  descendant  of  each 
node.  Any  attempt  to  follow  the  algorithm  as  described  may 
produce  incorrect  results.  The  purpose  of  this  section  is  to 
point  out  the  mistakes  in  Hirschberg's  algorithm  and  to 
offer  an  alternative  which  will  always  produce  correct 
intermediate  and  final  results. 

To  demonstrate  the  problem  in  Hirschberg*s  Algorithm 
CONNECT,  we  consider  a  simple  illustration  involving  the 
undirected  graph  G  depicted  in  figure  2  (a) .  Although  the 
intermediate  results  from  executing  lines  4  and  5  of  step  2 
in  Algorithm  CONNECT  differ  depending  on  whether  these  lines 
are  executed  in  series-*-  or  in  parallel,  the  correctness  of 
the  whole  algorithm  is  not  affected.  Therefore  we  can  assume 
that  all  instructions  at  each  step  will  be  executed  in 
parallel  as  long  as  no  write  conflict  exists-*"*-. 


-*-  If  Algorithm  CONNECT  is  executed  line  by  line,  at  the  end 
of  the  first  iteration,  the  same  contradiction  -  loop  of 
length  3  -  is  also  found.  While  at  the  end  of  the  second 
iteration,  node  3  and  node  5  are  relabelled  to  2  instead  of 
the  correct  value  I  which,  however,  are  relabelled  to  I  in 
the  last  iteration. 

ft  step  4  must  be  executed  in  series,  otherwise  write 
conflict  may  occur. 
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(a)  The  graph  G 
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(b)  Forest  of  tree-loops  obtained  after  step  1 
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(c)  Tree-loop  of  size  3  obtained  after  step  2 
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(d)  Correct  tree-loop  after  step  2 


Figure 


2:  Counter  Example  for  Algorithm  CONNECT 
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Adjacency  Matrix  A 
cf  Graph  G 
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Step  I 


Iteration _ 1 : 

Step  2.  1 

Step  2.  2  ( 1st) 
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Step  2.  2  (2nd)  and  step  2.  2  (3rd)  no  change 
Step  3  no  change 
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Step  4 


Step  5 
Step  6 
Step  7 
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Step  4 


Step  5 
Ste  p  6 
Step  7 

Iteration  3 : 

Step  2 

Step  3 

Steps  4-7  do 
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no  change 


no  change 
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not  affect  the  result  in  the  output  vector  D 


At  the  end  of  the  first  iteration,  the  selected 
incident  node  for  each  center  in  the  new  graph  G'  is  stored 
in  vector  C  which  defines  a  tree-loop  as  shown  in  Figure 
2(c).  Surprisingly,  a  loop  of  length  three  is  found  which 
contradicts  Lemma  3.  I.  Consequently,  at  the  end  of  the 
second  iteration,  node  1  and  node  6  are  relabelled  to  2 
instead  of  the  correct  value  I  while  at  the  end  of  the  last 
iteration,  two  connected  components  are  reported  m  the 
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output  vector  D  whereas  the  correct  result  is  ore  connected 
component.  The  correct  tree-loop  of  G«  is  depicted  in  Figure 
2(d).  The  essential  problem  in  the  algorithm  lies  in  step  7 
wherein  C  (x)  is  incorrectly  found  due  to  the 
misinterpretation  of  the  unselected  edges  of  the  centers.  To 
correct  this  problem,  the  following  modification  of  steps  6 
and  7  is  suggested,  (i.e. ,  delete  ,Flag(x)=0'  from  step  6  so 
that  the  smallest  unselected  node  directly  incident  upon 
each  center  is  included;  and  replace  1 A(x,y)=1*  in  step  7  by 
•D(y)=D(x)  «.) 

6 1 .  Vx  do 

C  (x)  < —  Min  {D  (y)  (x)  |  A  (x,  y)  =  1}  if  none  then  D(x) 

7'.  Vx  s.t.  Flag(x)=1  do 

C(x)  < —  Min  (C  (y)  #D  (x)  |D  (y)  =D  (x)  }  if  none  then  D(x) 

In  step  7',  nodes  belonging  to  the  current  centers  are 
identified  by  checking  their  current  node  label  with  the 
center  label,  and  conseguently  step  4  can  be  eliminated. 

A  minor  error  is  also  found  in  step  1  when  finding  the 
smallest-labelled  incident  node  for  each  node  which  is 
initially  isolated.  To  be  consistent  with  the  rest  of  the 
algorithm  in  handling  an  isolated  node,  the  value  of  node  x 
is  assigned  to  C(x).  Therefore  line  3  in  step  1  is  modified 
as  follows: 

C  (x)  < —  Min  (y*x|A  (x,y)  =  1}  if  none  then  D(x) 
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3.  1.3  Mo dif ied_Hi rschbe rq ' s  Algorithm 

In  Hirsch jber g 1  s  Algorithm  CONNECT,  the  n2  processor 
requirement  is  contributed  solely  by  finding  the  smallest 
incident  node  (steps  1,  6  and  7).  Therefore  reducing  the 
processor  requirement  in  these  steps  may  substantially 
reduce  the  overall  processor  requirement.  Finding  the 
smallest  incident  node  is  equivalent  to  finding  the  minimum 
of  n  elements  which  can  be  done  in  21og  n  -  loglog  n  -  1 
time  units  with  rn/log  nn  processors  by  Theorem  2. 7.  Based 
on  this  result.  Savage  [37  p.49]  reduced  the  processor 
requirement  of  Hirschberg's  algorithm  to  nrn/log  nn  with 
efficiency  0(!/log  n) . 

A  question  now  arises  naturally:  can  the  efficiency  be 
reduced  to  0(1)?  That  is,  can  the  processor  requirement  be 
reduced  by  another  factor  of  log  n?  The  answer  to  this 
question  turns  out  to  be  a  favourable  one  when  the  optional 
Clean  Up  step,  stated  in  general  Hirschberg's  algorithm,  is 
executed  to  guarantee  the  reduction  of  non-isolated  nodes  by 
at  least  a  factor  of  two  after  each  iteration. 

To  achieve  the  desired  processor  reduction  in  finding 
the  smallest  incident  node,  the  strategy  of  optimization  is 
applied  in  two  different  manners:  for  the  first  iteration 
which  has  the  largest  problem  size  n,  minimize  the  processor 
requirement  P(n)  subject  to  0(log2n)  time;  for  the 
subsequent  iterations,  minimize  the  time  subject  to  P (n) 
processors.  Since  the  "tMin"  operation  is  binary  associative, 
by  Theorem  2.7,  the  smallest  incident  node  can  be  determined 
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in  less  than  21og  n  time  units  with  rn/log  nn  processors.  It 
is  obvious  that  nc  more  than  21og1 2n  time  units  are  required 
by  using  rn/log2nn  processors.  Hence,  for  n  nodes, 
P(n)  =  nrn/log2nn  processors  are  reguired.  From  the  second 
iteration  onwards,  the  worst  case  time  requirement  for  each 
subsequent  iteration  can  also  be  calculated  by  Theorem  2. 7. 
As  desired,  the  worst  case  total  time  for  all  log  n 
iterations  remains  0(log2n).  A  more  detail  analysis  will  be 
discussed  after  the  modified  Hirschberg*s  algorithm  is 
presented. 

Based  on  the  above  stated  strategy  of  optimization. 
Algorithm  CONNECT  is  modified  to  Algorithm  MOD. CONNECT  as 
shown  in  Figure  3.  With  respect  tc  the  corrected  Algorithm 
CONNECT  in  Section  3. 1.2,  the  modifications  are  discussed  as 
follows: 

1.  Observe  that  in  Algorithm  CONNECT,  the  number  of  times  to 
find  the  smallest  incident  node  is  log  n  +  I  which  is  one 
more  than  the  theoretical  worst  case.  Indeed,  the  last 
one  is  virtual  work  which  can  be  eliminated  by  slightly 
rearranging  the  order  of  instructions,  i.e.,  move  steps 
6f  and  7'  in  corrected  Algorithm  CONNECT  up  to  the 
beginning  of  the  first  for-loop  (before  step  2)  and 
delete  the  third  line  of  step  1.  (The  corresponding  steps 
in  Algorithm  MOD. CONNECT  are  steps  2  and  3.) 

2.  The  adjacency  matrix  is  updated  by  performing  logical  OR 
between  the  selected  columns  (step  8)  •  Since  vector  Flag 
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Algorithm  MOB. CONNECT 


begin  comment : Flag (x) =  I  indicates  node 
center  and  column  x  stores 
information. 


x  is  a  current 
the  updated 


I 


2 


3 

4 

5 

6 


7 

8 

9 


¥x  do  comment:  Initialization 
begin  D  (x)  < —  x 

Flag  (x)  < —  I 

end 

for  i= 1  until  log  n  do 

begin  comment:  Uniform  Smallest  Incident  Node  Selection 
¥x,y  s.  t.  Flag(y)=1  do 

C(x)  < —  Min  {D  (y)  #D  (x)  |A  (x,y)  =  1} 
if  none  then  D(x) 

¥x  s.t.  Flag  (x)  =  I  do 

begin  C  (x)  < —  Min  {  (C  (y)  (x)  |  D  (y)  =D  (x)  } 

if  none  then  D (x) 
if  C  (x) =D  (x)  then  Flag(x)  < —  0 
comment:  Path  Compression 
D  (x)  < —  Min  {D  (x)  ,  E[  C  (x)  ]} 
for  j=  I  until  log(n-l)  do 
begin  C  (x)  < —  C[C(x)  ] 

D(x)  <—  D[C  (x)  ] 

end 

end 

¥x  s.t.  Flag  (x)  =0  do  D(x)  < —  D[D(x)] 
comment:  Clean  Up  (by  column  contraction) 

¥x,y  s.t.  y=D(y)  do 

¥z  s.t.  Flag(z)  =  1  do 

A(xry)  <—  OR{A(x,z)  |D(z)=D  (y )  } 

¥x  do  if  D(x)*x  then  Flag(x)  < —  0 

end 

end 


Figure  3:  Modified  Hirschberg*s  Algorithm 


has  been  designed  to  indicate  the  current  centers,  its 
values  can  be  used  as  addresses  of  columns  in  the 
subsequent  assignement  so  that  only  those  columns  where 
the  appropriate  value  of  Flag  is  I  will  be  fetched.  Thus, 
for  finding  the  smallest  incident  node,  checking 
condition  'Flag(y)  =  P  in  step  2  of  Algorithm 
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MCE. CONNECT  will  assure  the  desired  problem  size 
reduction  (i.e.r  reduction  of  number  of  columns)  and 
consequently  achieve  the  desired  time  reduction. 

3.  If  the  whole  connected  component  has  merged  to  a  single 
center  (i. e. ,  the  center  will  be  isolated) ,  that  center 
will  not  be  considered  in  the  succeeding  iterations  by 
having  its  flag  set  to  zero  at  step  4. 

In  order  to  prove  that  these  modifications  to  Algorithm 
CONNECT  produce  the  desired  processor  reduction,  a  few 
definitions  and  lemmas  are  needed. 

let  Mi  be  the  smallest  element  among  the  k  selected 
elements,  {M  1 , M2,. . . ,Mk} ,  from  a  vector  V  of  size  n.  To 

"merge"  the  k  selected  elements  means 

V  (Mi)  < —  0?{V(Mj)  #0|  l<j<k}. 

where  OP  can  be  Min  or  logical  OF.  which  depends  on  using  the 
index  or  the  content  of  V. 

Given  a  vector  of  n  elements,  if  at  least  2  and  at  most 

n  elements  are  "merged"  to  form  a  new  element,  then  the  size 

of  the  vector  is  reduced  to  at  least  one  and  at  most  Ln/2J 
accordingly.  Such  a  "merge"  is  called  shrink. 

Lemma  3.4:  Given  K  processors,  performing  a  shrink 
operation  on  a  vector  V  of  size  n  requires  at  most 

rH/Kn  -  I  +  log  K  time  units  if  ‘-n/2-1  >  K  and  log  n  time 
units  if  *-n/2J  <  K. 

Proof:  Performing  a  shrink  operation  on  V  can  be 
thought  of  as  partitioning  V  into  m  groups. 
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{g(  U  ,g(2) 


w  «  «  «  r 


9(m)}r  and  performing  ’'merge"  simultaneously 


on  all  m  groups.  Clearly,  the  parallel  computations  to 
"merge"  n  elements  is  a  binary  computation  tree  which 
requires  n-  I  operations  (internal  nodes)  in  total,  let  T  be 
the  time  to  shrink  V,  K  be  the  number  of  processors 
available. 

Case  I:  No  partition,  i. e. ,  all  n  elements  are  "merged"  to 
form  a  single  new  element.  This  can  be  solved  optimally  by 
using  Algorithm  A  (n)  I  or  A(n)2  in  Section  2.2.1.  Hence,  the 
time  bound  follows  directly  from  Theorem  2.7,  i.  e. 


if  Ln/2J  >  K 


(1) 


if  »-n/2-»  <  K 


where  0  <  r  =  n  -  K Ln/KJ  <  K. 

Case  2:  V  is  partitioned  into  m  groups.  It  can  be  considered 
as  splitting  an  n-leaf  binary  computation  tree  into  m 
smaller  binary  computation  subtrees.  Because  of  the 
splitting,  the  independence  between  subtrees  increases  the 
parallelism  while  the  total  number  of  internal  nodes  is 
decreased.  The  total  number  of  operations  is  the  sum  of  the 
internal  nodes  of  all  m  binary  computation  subtrees,  i.e.  , 

I  g  (  I)  |  -  1  +  !g(2)|-1  +...+  |g(m)  |-l  =  n-m. 

Case  2a:  K  >  «-n/2-*  .  Assign  *-|g(i)l/2J  processors  to  group 
g(i)  and  then  execute  all  groups  in  parallel  which  takes 
Max  {log  |  g  (i)  |  for  1  <  i  <  m}  <  log  n  time  units.  The  total 
number  of  processors  required  are  as  follows. 
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m 

£  Hg  (i)  l/2-i  <  in/2J. 
i=  1 

Case  2b:  K  <  Ln/2J .  Align  the  m  groups  of  elements  as  shown 
in  Figure  4  and  partition  the  elements  into  K  sets,  each  of 
rn/K-j  elements,  execpt  that  the  last  sets  has  n  -  (K-  1 )  rn/K-\ 
ele  ments. 


g  (m) 

- >< - > 

X  I  XXX. .  .  X 

I 

I  < — r — > 


where  k  =  rn/Kn  elements 

r  =  n  -  (K-l)  i-h/Kt  elements 


g(0  g(2)  g  (3) 

< - >< - >< - >< - 

X.  .  .  X  1  X.  .  .  X|  XXX.  .  .  X|  XXX.  .  .  XX.  .  .  X  I  X 

III  I 

<-*-> ,  <-k_> ,  <__k— >  j  < - k - >  I 


Figure  4:  Partition  cf  m  Groups  into  K  Sets 


Assign  one  processor  to  each  of  the  K  sets  to  compute 
the  results  in  that  set.  If  all  the  elements  in  a  set  belong 
to  the  same  group,  one  answer  will  result  from  that  set.  If 
the  elements  in  a  set  belongs  to  several  groups,  say  b 
groups,  then  b  answers,  one  for  each  group,  will  be 
obtained.  If  b  >  2,  at  least  b  -  2  answers  are  final  results 
and  at  most  2  answers  in  each  set  will  be  combined  with 
answers  in  other  sets  to  give  the  final  result  (the  first 
and  the  last  set  have  one  answer)  .  For  example,  (see  Figure 
4)  ,  sets  I  and  2  have  one  answer,  set  3  has  2  answers  and 
set  4  has  3  answers  while  one  of  them  is  the  final  result. 
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It  io  obvious  that  no  more  than  rn/Kn  ~  1  time  units  are 
needed  fcr  computing  answers  in  each  set, 

let  us  assume  that  n(i)  answers  will  be  combined  to 
give  the  result  of  g  (i)  .  (In  Figure  4,  n  ( 1 )  =  3,  n(2)  =2 
and  n(3)  =  0.)  Assign  *-n(i)/2J  processors  to  each  group  to 
compute  the  result  of  that  group.  Since  the  sum  of  all  n(i) 
is  less  than  or  egual  to  2K  -  2,  the  total  number  of 
processors  required  will  be  less  than  K.  Each  g(i)  will  take 
another  log  n(i)  <  log  K  time  units  to  obtain  the  final 
result.  Thus,  the  total  time  requirement  is  no  more  than 
i-n/KT  -  1  +  log  k  time  units.  It  is  clear  that  this  time 
requirement  is  at  most  one  time  unit  from  optimal  (i.e.,  the 
corresponding  case  in  (I)).  g 

Theorem  2.7  and  Lemma  3.4  give  an  upper  bound  on  the 
parallel  time  complexity  for  computing  a  result  of  n 
elements  and  results  of  groups  of  n  elements.  Since  the 
’Min*  operation  in  steps  2  and  3,  and  ,0fif  operation  in  step 
8  of  Algorithm  MOD. CONNECT  are  associative  binary 
operations.  Theorem  2.7  and  Lemma  3.4  give  an  upper  bound  on 
the  total  number  of  time  units  spent  in  these  steps. 

Lemma  3.5:  Given  K  processors,  step  2  in  Algorithm 
MOD. CONNECT  takes  at  most  T  time  units  where 
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j  °(l092n)  if  K>  n Ln/2-» 

T  =  |  0  (n  2/K  +  leg  nlog  K)  if  n  <  K  <  n  «-n/2J 

^°(n2/K)  if  0  <  K  <  n. 

Proof:  As  mentioned  earlier  (Lemmas  3. 1-3.3),  further 

iterations  of  steps  2-8  merge  centers.  Step  9  eliminates 
those  merged  nodes  which  are  no  longer  centers  by  setting 
their  flags  to  zero.  By  Lemma  3.3,  the  number  of  centers 
(flagged  elements)  |S|,  in  each  connected  component 
decreases  by  a  factor  of  at  least  two  after  each  iteration 
until  the  connected  component  is  represented  by  a  single 
center.  Moreover,  if  the  whole  connected  component  has 
merged  to  a  single  center  (i.e.,  the  center  will  be 
isolated) ,  that  center  will  not  be  considered  in  the 
succeeding  iterations  by  having  its  flag  set  to  zero  at  step 
4.  Thus,  we  have  n  flagged  elements  at  the  first  iteration 
and  have  at  most  Ln/21J  flagged  elements  after  i  iterations. 
At  step  2,  in  erder  to  compute  all  C(x),  p  =  »-K/nJ 
processors  are  assigned  to  each  x  to  compute  the  minimum 
value  among  at  most  |S|  elements.  Since  'Min*  is  an 
associative  binary  operation,  we  can  apply  Theorem  2.7  to 
evaluate  the  time  complexity. 

Case  1:  K  >  n«-n/2-J,  we  have 
log_n-1 

>  log  (n/21 )  =  0  (log2n)  . 
i=0 

Case_2:  n  <  K  <  n  *-n/2-»  implies  I  <  p  =  LK/n-»  <  Ln/2J 

processors  can  be  assigned  to  compute  each  C  (x) «  Since  j  Sj 
is  reduced  by  at  least  half  after  each  iteration,  |S|  is  at 
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most  after  t  log  n  -  Llog  pj  iterations.  Thus  we  have 

*  lo  g_n- I 

-  (r(Ln/2i-i)/p-l  -  |  +  i0g  P)  +  >  iog(n/2i) 

1_u  i=t 

<  r2n/ LK/nJn  +  tlog  «-K/n-«  +  (log  «-K/n-* )  2 
^  0  (n2/K  +  log  Blog  K) . 

Case  3:  0  <  K  <  n  implies  that  only  K  C  (x)  can  be  computed 

m  parallel  with  one  processor  each.  This  takes  |S|  -  1  time 
units.  For  n  C  (x) ,  the  same  computation  is  repreated  no  more 
than  i-ii/Kt  times.  Thus,  we  have 
log_n-  I 

>  (  Ln/2  *-»  -  1)  rn/Kn 

i=0 

<  2n  rn/Kn  <  0  (n2/K)  .  n 


Lemma^^S:  Given  K  processors,  step  3  in  algorithm 
MOD. CONNECT  takes  at  most  T  time  units  where 


I  = 


•  0  (log2n) 

0  (n2/K  +  log2n) 
0 (n  2/K) 


if  K  >  nLn/2J 
if  n  <  K  <  n*-n/2J 
if  0  <  K  <  n. 


Proof;  Since  the  number  of  flagged  elements  is  reduced 
by  at  least  half  after  each  iteration,  the  number  of 
processors  assigned  to  compute  C(x)  can  be  double  after  each 
iteration.  For  the  first  iteration,  p  =  LK/nJ  processors  are 
assigned  to  compute  each  C(x).  After  i  iterations,  (2*)p 
processors  can  be  assigned  for  each  C(x).  Thus,  we  have 
Case  1:  K  >  n«-n/2-»  implies  p  =  ‘•K/nJ  >  *-n/2J  processors  can 
be  assigned  to  compute  each  C(x). 
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l°g_n-  1 

>  log  n  =  0  (log2 n) . 
i=0 

Case  2:  n<K<  n«-n/2J  implies  I  <  p  =  «-K/n-«  <  >- n/2-» 

processors  are  assigned  to  compute  each  C  (x)  for  the  first 
iteration.  After  t  =  log  n  -  log  p  +  1  iterations, 

2^p  >  Ln/2J  processors  can  be  assigned  to  each  C  (x) .  Thus, 
we  have 

t~i  #  t  log_n-l 

^  ( rn/  ( 2 i p )  -  I  +  log(2*p))  +  >  log  n 

i=0  i=t 

<  r2n/LK/nJT  +  log  nlog  n  +  log2n 

<  0  (n2/K  +  log2n)  . 


Case  3: 

0  <  K  <  n  implies  only  K  C(x) 

can 

be 

computed  in 

parallel 

with  ore  processor  for 

each 

C(x) 

which  takes 

n/2 i  -  1 

time  units  for  iteration  i. 

The 

same 

computation 

must  be  repreated  r(n/21)/Kn  time  units.  Thus,  we  have 
log_n-  I 

>  (n/2  i  -  1)r(n/2M/Ki  =  2n  r2n/Kn 

i=0 

<  0 (n2/K) .  0 


Lemma  3.7;  Given  K  processors,  step  8  in  Algorithm 
MOD. CONNECT  takes  at  most  T  time  units  where 


T  = 


0  (log2n) 

0  (n2/K  +  log  nlog  K) 
0 (n2/K) 


if  K  >  n  Ln/2J 
if  n  <  K  <  n  Ln/2J 
if  0  <  K  <  n. 


Proof:  After  sets  of  centers  are  merged  in  steps  5  and 
6,  the  adjacency  information  among  the  centers  is  updated  in 
step  8,  i. e.  center  x  and  center  y  are  connected  by  setting 
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A(x,y)  to  1,  if  there  exists  an  edge  joining  one  of  the 
nodes  merged  into  x  to  one  of  the  nodes  merged  into  y.  In 
step  8,  those  columns  z  in  the  adjacency  matrix  A 
corresponding  to  those  nodes  z,  which  are  merged  to  center 
y,  are  'Oh*  together  to  give  the  new  column  y.  Since  1 OE 1  is 
an  associative  binary  operation  and  groups  of  columns  are 
1  OE  *  together  to  single  columns  which  correspond  to  the 
new-formed  centers.  Lemma  3.4  can  be  applied  to  derive  the 
time  bound  for  step  8.  There  are  n  rows  in  A  and  these  rows 
of  elements  are  considered  in  parallel.  As  what  is  done  for 
step  2,  p  =  LK/nJ  processors  are  assigned  to  each  row  x  to 
compute  A(x,y).  During  the  first  iteration,  p  processors  are 
assigned  for  each  row  to  deal  with  n  elements;  and  for  each 
succeeding  iteration,  the  number  of  elements  to  be  dealt  by 
the  p  processors  is  at  most  half  of  the  number  in  the 
previous  iteration.  Thus,  applying  the  same  kind  of  analysis 
as  in  the  proof  of  Lemma  3. 5,  we  derive  the  same  time  bound 
as  stated  in  Lemma  3. 5.  D 

Theorem  3.8:  Algorithm  MOD. CONNECT  finds  the  connected 
components  of  an  undirected  graph  with  n  nodes  in  time 
Q(n2/K  +  log2n)  using  K  processors. 

Proof:  The  time  and  processor  requirements  are  listed 
in  Table  2.  From  Table  2,  K  processors  suffice  to  determine 
the  connected  components  of  an  undirected  graph  with  n  nodes 

D 


in  time  0  (n2/K  +  lcg2n) . 
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Step 

case  1 

Total  Time 
case2 

case3 

Processors 
easel  case2  case 

1 

0  (n/K) 

0(1) 

0(1) 

K 

n 

n 

2 

0 (n2/K) 

0  (n2/K  +  lognlogK) 

0  (log2n) 

K 

nK 

n2 

3 

0 (n2/K) 

0  (n2/K+log2n) 

0 (log2n) 

K 

nK 

n2 

4 

0(n/K) 

0  (log  n) 

0 (log  n) 

K 

n 

n 

5 

0  (n/K) 

0  (log  n) 

0 (log  n) 

K 

n 

n 

6 

0 (nlogn/K) 

0  (log2n) 

0 (log2n) 

K 

n 

n 

7 

0  (n/K) 

0  (log  n) 

0 (log  n) 

K 

n 

n 

8 

0 (n2/K) 

0  (n2/K  +  lognlogK) 

0 (log2n) 

K 

nK 

n2 

9 

0  (n/K) 

0  (log  n) 

0 (log  n) 

K 

n 

n 

where  caseliO  <  K  <  n 

case2:n  <  K  <  n‘-n/2J 
case3:K  >  nLn/2J 


Table  2:  Total  Time  and  Processor  Requirements  for  Alogrithm 
MOD. CONNECT 


As  a  by-product  of  Theorem  3.8,  we  have  the  following 


result. 


Corollary  3.9:  Given  nrn/log2nn  processors.  Algorithm 
MOD. CONNECT  determines  the  connected  components  of  an 
undirected  graph  with  n  nodes  in  time  D(log2n). 

From  Table  2,  Algorithm  MOD. CONNECT  takes  T(1)  =  0(n2) 
and  T(p)  =  0(log2n)  time  units  with  I  and  p  =  nrn/log2nn 
processors  respectively.  Hence,  the  speedup  and  the 
efficiency  of  Algorithm  MOD. CONNECT  are  S  (p)  =  0(n2/log*n) 
and  E(p)  =  0(1).  Ibis  is  the  best  result  that  uses  the  least 
number  of  processors  to  find  the  connected  components  of  an 


( ISfl)  c 

ia\n)0 

(  i  0  +  \ 

• 

. 


_ 


' 


53 


undirected  graph  in  time  0  (log2n) •  The  previous  results 

[37  p.45]  needs  nrn/log  ^  processors  to  achieve  the  same 
time  bound. 

It  is  of  interest  to  compare  the  time  and  processor 
complexities  of  Algorithm  MOD. CONNECT  with  the  time 
complexity  of  the  corresponding  efficient  sequential 
algorithm.  Consider  a  graph  G  =  (V,E)  where  | V J  =  n  and 
IE |  =  m.  If  G  is  dense,  i. e.  m  =  0 (n2) ,  the  best  sequential 
algorithm  for  this  problem  requires  T(1)  =  0(m+n)  =  0(n2) 
time  [43].  This  is  because  any  algorithm  will,  in  the  worst 
case,  have  to  look  at  all  of  the  edges.  Therefore  the 
speedup  and  its  efficiency  of  Algorithm  MOD. CONNECT  over  the 


best  sequential 

algorithm  also 

are 

0{n2/log2n)  and 

0(1) 

respectively.  ' 

On  the 

other 

hand. 

if 

G  is  sparse. 

i.  e. 

m  =  0  (n)  ,  then 

1(1)  = 

0(n) 

and 

S(P) 

=  0(n/log2n) 

and 

E(P)  =  0  {  l/n) .  This  implies  a  lot  of  waste  of  the  processing 
power  to  achieve  the  speedup  of  0  (n/log2n) . 

Remarks  on  Algorithm  MOD. CONNECT: 

I.  On  average,  it  is  profitable  to  detect  the  earliest 
termination  of  Algorithm  MOD. CONNECT  especially  when  the 
graph  is  dense.  One  of  the  stopping  criteria  is  when  no 
center  is  "merged"  in  the  current  iteration.  The 
realization  is  suggested  below: 

Insert  step  1.5  'Last  < —  n*  into  step  I  and  the 
following  two  steps  at  the  end  of  the  first  for-loop. 

10  Now  < —  Sum  [Flag  (x)  1 1<x<n) 


. 
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I  I  if  Last=Now  then  STOP  else  Last  < —  Now 
In  total,  step  1.5  and  step  II  require  I  and  at  most 
log  n  time  units  respectively  by  using  one  processor; 
whereas  step  10  requires  at  most  log2n  time  units  by 
using  *-n/2J  processors. 

2.  When  implementing  the  Clean  Up  step,  one  can  contract  the 
rows  instead  of  the  columns  of  the  adjacency  matrix.  It 
will  end  up  with  the  same  time  and  processor  bounds  as 
Algorithm  MCE. CONNECT.  Furthermore,  rows  and  columns  can 
be  contracted  one  after  the  other  to  gain  more  time 
reduction  for  the  next  contraction  and  finding  the 
smallest  incident  node  from  the  second  iteration  onwards. 
However,  the  tradeoff  between  spending  more  time  on  each 
Clean  Up  step  and  less  time  cn  subsequent  iterations  is 
insignificant. 


3.1.4  i£Elig£ticns_of  Algorithm  MOD.  CONNECT 

Ihe  following  paragraphs  give  brief  descriptions  of  how 
Algorithm  MCE . CCNNECT  is  applied  to  other  related  problems, 
such  as  finding  all  spanning  trees,  finding  the  minimum 
spanning  tree  and  the  transitive  closure  of  an  undirected 
graph;  they  are  intended  simply  to  show  the  modification  of 
Algorithm  MCE. CCNNECT  accordingly  and  hence,  merely  the 

of  the  additional  and  modified  steps  are  given. 


complexities 
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Spanning  Trees 

Generating  all  spanning  trees  of  a  connected, 
undirected  graph  G  has  played  an  important  role  in  many 
parallel  graph  problems.  for  instance,  in  finding  the 
biconnected  components  of  G,  the  first  step  is  to  generate 
all  spanning  trees  of  G  [37  p.63-64].  Consider  what  happens 
m  Algorithm  MOD.CCNNECT.  The  algorithm  basically  generates 
a  spanning  tree  for  each  connected  component  by  using 
uniform  depth-first  search.  However,  the  edges  in  each 
spanning  tree  are  not  recorded.  This  can  be  efficiently 
implemented  by  using  three  vectors  Col,  Head  and  Tail  where 
Head  (x)  ,  Tail(x)  €  V  and  the  set  {Head  (x)  ,Tail(x)  |Flag  (x)  #1 
for  1<x<n)  is  the  set  of  edges  in  all  spanning  trees.  Also 
the  column  contraction  in  Algorithm  MOD. CONNECT  will  destroy 
the  correct  column  indices.  As  a  result,  the  selected  edges 
can  net  be  correctly  represented  by  the  matrix  indices.  To 
remedy  this  problem,  the  adjacency  matrix  is  modified  by 
resetting  the  connected  edges  in  each  row  i  of  A  with  the 
corresponding  column  indices  and  substituting  *Min'  for  *0Ef 
in  step  8.  The  realization  of  the  modified  Algorithm 
MOD . CONNECT,  called  Algorithm  SPAN. TREE,  is  shown  in  figure 
5. 

Resetting  A  (step  ! e)  requires  0(log2n)  time  units  with 
rn2/lcg2nn  processors;  initializing  Head  and  Tail  (steps  Ic 
and  Id),  and  updating  Head  (step  3d)  require  0(1)  time  units 
with  n  processors;  updating  Col  and  Tail  (steps  2b  and  3c) 
require  at  most  0(log2n)  time  units  with  nrn/log2nn 
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Algorithm  SPAN. TREE 


1 

la 

1b 

1c 

Id 

1e 


2 

2a 

2b 

3 

3a 

3b 

3c 

3d 

4 

5 

6 

6a 

6b 


begin 

¥x  do  comment:  Initialization 
begin  D(x)  < —  x 

Flag  (x)  < —  1 
Head(x)  < —  0 
Tail(x)  < —  0 

if  A(xry)=1  then  A(xry)  < —  y 

end 

for  i=1  until  log  n  do 

begin  comment:  Uniform  Smallest  Incident  Node  Selection 
Vx,y  s.t,  {D  (y)  *D(x)  AND  Flag(y)=1  AND  A(x,y)*0}do 
begin  C(x)  < —  Min{D(y)}  if  none  then  D(x) 
Col  (x )  < —  Min  £y| D (y) =C  (x) } 

end 

¥x  s.t.  Flag(x)  =  1  do 

begin  ¥y  s.  t .  {D  (y )  =D  (x)  AND  C(y)*D(x)}  do 
begin  C(x)  < —  Min  {C  (y) } 

Tail(x)  < —  Min  {y  |C(y)  =C  (x) } 

Head  (x)  < —  A[  Tail  (x)  # Col  (Tail  ( x) )  ] 

end 

if  C (x) =D  (x)  then  Flag(x)  < —  0 
comment:  Path  Compression 
D  (x)  < —  Min{D(x)  ,  D[C(x)  ]} 

for  j=1  until  log(n-l)  do 
begin  C(x)  < —  C[C(x)  ] 

D  (x)  < —  D[C  (x)  ] 

end 


7 

8 

8a 

8b 

9 

end 

end 


end 

¥x  s.t.  Flag  (x)  =0  do  D(x)  < —  D[D(x)] 
comment:  Clean  Up  (by  column  contraction) 


¥xr  y  s.t.  y=D (y)  do 

¥z  s.t.  Flag  (z)  =1  do 

A(x,y)  < —  Min  {A  (x,z)  *0|  D  (z) -D  (y) 

¥x  do  if  D(x)#x  then  Flag(x)  < —  0 


} 


Figure  5:  Algorithm  SPAN. TREE 
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processors.  Hence,  the  changes  do  not  affect  the  upper  time 
and  processor  bounds  of  Algorithm  MOD. CONNECT  (Corollary 
3.9)  and  thus  Algorithm  SPAN, THEE  generates  all  spanning 
trees  of  G  in  0  (log2n)  time  units  with  nrn/log2nn 
processors. 

Minimum  Spanning  Tree 

Given  a  weighted,  connected  and  undirected  graph  G,  it 
is  often  of  interest  to  determine  a  spanning  tree  of  minimum 
total  edge  weight,  i. e. ,  such  that  the  sum  of  the  weights  of 
all  edges  in  the  tree  is  minimum.  Such  a  tree  is  called  a 
minimum  spanning  tree  (MST)  .  In  [28],  Levitt  and  Kautz 
implemented  Sollin*s  algorithm  [7  p. 189  ]  on  their  cellular 
array  and  yielded  the  first  parallel  algorithm  for 

determining  MST.  By  using  a  transitive  closure  algorithm  to 
find  the  spanning  trees,  Csanky  [13]  produced  an  0 (log3n) 
algorithm  with  n2-  81  processors.  By  modifying  Algorithm 
CONNECT  to  find  spanning  trees.  Savage  [37  p.42-45]  reduced 

the  time  to  0 (log2n)  with  nrn/log  nn  processors.  The 

processor  bound  of  her  method  can  further  be  improved  to 
nrn/log2nn  by  employing  Algorithm  SPAN • TEEE  with  a  similar 
modification  in  selecting  smallest  incident  node.  That  is, 
node  j  is  selected  to  be  the  smallest  incident  node  for  node 
i  if  edge  (i,j)  has  the  minimum  weight  among  all  the  edges 
emanating  from  node  i.  Moreover,  if  there  is  a  tie,  then 
the  edge  with  smallest  node  label  as  before.  This 
minor  modification  will  merely  double  the  time  in  finding 
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smallest  incident  node  in  Algorithm  SPAN. IEEE.  This  is 
summarized  in  the  corollary  below. 

Corollary,  3.  10:  The  MST  of  a  weighted,  connected  and 
undirected  graph  with  n  nodes  can  be  determined  in  0(log2n) 
time  units  by  using  nrn/log2nn  processors. 

Transitive  Closure  of  Undirected  Graphs 

Once  the  connected  components  of  an  undirected  graph  G 
are  found,  the  transitive  closure  A"  of  G  can  be  easily 
obtained  with  an  additional  comparison,  i.e..  A" (i,j)  =  1  if 
and  only  if  i  and  j  belong  to  the  same  connected  component. 
This  additional  comparison  can  be  done  in  log2n  time  units 
by  using  rn2/log2nn  processors.  Hence,  with  this  additional 
step.  Algorithm  MCE. CONNECT  finds  the  transitive  closure  of 
an  undirected  graph  with  n  nodes  in  0(log2n)  time  units  by 
using  nrn/log2nn  processors.  This  improves  on  Savage's 
0 (log2n)  algorithm  with  nrn/lcg  processors  [37  p-50]. 
Thus,  we  have  the  following  corollary. 

Corollary  3.  I  I;  The  transitive  Closure  of  an  nxn 
symmetric  Boolean  matrix  can  be  determined  in  0 (log2n)  time 
units  using  ni-n/l°92ni  processors. 

This  method,  unfortunately,  can  not  be  extended  to 
reduce  the  processor  bound  for  the  problem  of  finding 
all-pairs  shortest  paths. 


. 


, 

. *  ’  ,  ?  i  i  '  L\  i 


• '  i _ i  - 


M 


59 


3 . 2  S ome_Conn e c t  i  v  i  t  y  Problems  in  Directed  Graphs 

In  directed  graphs,  the  properties  involving  paths, 
cycles,  and  connectivity  become  more  complicated  than  in 
undirected  graphs  because  of  the  edge  orientations.  In  this 
section,  we  turn  our  attention  to  directed  graphs  and  see 
whether  we  can  get  good  upper  bounds  for  the  problems  of 
finding  different  connected  components. 

Weakly  Connected  Components 

ihe  weakly  connected  components  of  a  directed  graph  are 
easily  obtained  by  ignoring  the  edge  directions  and  removing 
duplicate  edges,  and  then  using  Algorithm  MOD. CONNECT  to 
find  the  connected  components  of  the  resulting  undirected 
graph.  The  conversion  of  the  adjacency  matrix  can  be  done  in 
log2n  time  units  with  rn2/log2nn  processors.  Thus,  finding 
weakly  connected  components  of  an  n-node  directed  graph 
takes  at  most  0{log2n)  time  units  with  nrn/log2nn 
processors.  The  algorithm  is  as  follows: 

Algorithm  WEAK. CON 

1.  Vi,  j  A  (i,  j)  < —  A  (i  ,  j)  OE  A(j,i) 

2.  Find  the  connected  components  by  Algorithm  MOD. CONNECT 

S t  r  onp ly_Cp n nec t ed_Com  p  one  n t  s 

Now  consider  the  problem  of  strongly  connected 
components  of  a  directed  graph  with  n  nodes.  Based  on  matrix 
multiplication  for  computing  the  reflexive  transitive 
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closure  A"  of  adjacency  matrix  A,  Arjomandi  [2]  presented  an 
algorithm  for  finding  the  strongly  connected  components  of 
an  n-node  directed  graph  in  0  (log2n) time  units  with  n3 
processors  and  2n2  +  n  storage.  By  using  Chandra’s  parallel 
Strassen's  algorithm  to  compute  A”,  we  propose  a  new 
Algorithm  STEONG.CON  below  which  reguires  0  (log  n(n2*81)/P) 
time  units  with  P  <  n2<8l/log  n  processors  and  n2  +  n 
storage. 

Algorithm  STEONG.CON 

1.  Find  AM ,  the  reflexive  transitive  closure  of  A. 

2.  A"(i,j)  < —  A"  (i,  j)  AND  A»(j,i) 

3.  Index (i)  < —  the  position  of  the  first  nonzero  entry  in 

row  i 

In  Algorithm  STEONG.CON,  A  reguires  n2  storage. 
Employing  Chandra’s  algorithm  to  compute  A"  takes 
0  (log  n(n2*81)/P)  time  units  with  P  <  n2*81/log  n 
processors.  Since  row  i  of  A  marks  all  the  reachable  pairs 
of  nodes  from  node  i,  step  2  only  removes  the  mark  of  the 
nonmutually  reachable  pairs  cf  nodes  from  A  which  takes  one 
time  unit  with  n2  processors.  Step  3  uses  the 
smallest-labelled  node  in  each  strongly  connected  components 
to  identify  the  corresponding  components  which  takes  log  n 
time  units  with  n2  processors  and  n  extra  storage.  (Notice 
that  due  to  the  symmetry  of  A",  steps  2  and  3  can  be 
performed  only  on  the  lower  triangular  matrix  of  A"  without 
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affecting  the  result. )  Hence,  the  total  time  is 
0(log  n(n2-ai)/P)  time  units  with  P  <  n2-8Vlog  n 
processors.  The  result  is  summarized  as  follows. 

Corollary  3.  12:  Algorithm  STECNG.CON  determines  the 
strongly  connected  components  of  an  n-node  directed  graph  in 
0(log  n(n2*81)/P)  time  units  using  P  <  n2*81/log  n 
processors. 

Unilateral  Connectivity 

In  the  case  of  finding  the  unilaterally  connected 
components  of  an  n-node  directed  graph  G,  Arjomandi  [3] 
showed  that  the  number  of  unilaterally  connected  components 
can  grow  exponentially  with  n.  It  implies  either  exponential 
time  or  processors  are  required  for  any  parallel  algorithm 
(otherwise  the  open  question  P  =  NP  would  be  settled.)  which 
is  beyond  the  scope  of  this  thesis.  Nevertheless,  if  the 
problem  is  limited  to  verify  unilateral  connectivity,  we 
construct  a  parallel  algorithm  below  which  solves  this 
problem  in  0 (log  n(n2*81)/?)  time  units  with  P  <  n2*81/log  n 
processors. 

Algorithm  UNI._CON 

1.  Find  A",  the  reflexive  transitive  closure  of  A. 

2.  A"  (i,  j)  <—  A»(i,j)  OB  A»(j,i) 

34  2_f  ^ ii  (i,  j)=  I  Vi ,  j  then  return  1  Unilaterally  connected1 

else  return  1  Non-unilaterally  connected1 
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Step  I  computes  all  the  reachable  nodes  from  all  nodes 
which  can  be  done  in  0(log  n(n2«81)/P)  time  units  with 
P  <  n2  • 81/log  n  processors  using  Chandra’s  algorithm.  By 
definition,  if  G  is  unilaterally  connected,  there  must  be  at 
least  one  path  between  each  pair  of  nodes.  This  is  simulated 
in  steps  2  and  3  using  one  and  21og  n  time  units,  and  n2  and 
Ln2/2J  processors  respectively 

Hence,  Algorithm  UNI.  CON  takes  0  (log  n(n2*81)/P)  time 
units  with  P  <  n2*81/log  n  processors. 

Acyclicness 

The  problem  of  verifying  acyclicness  of  a  directed 
graph  G  can  be  reduced  to  computing  the  transitive  closure 
A"  of  G.  If  all  the  diagonal  elements  of  A”  are  zero,  it 
implies  that  there  exists  no  cycle  in  G.  The  algorithm  is  as 
follows. 

Algorithm  Acyclic 

1.  find  A",  the  transitive  closure  of  G 

2.  if  A"  (i ,i) =0  ¥i  then  return  ’Acyclic* 

else  return  ’Cyclic* 

Step  I  can  be  done  in  0  (log  n(n2*8M/?)  time  units 
Qc;2_jjg  p  5;  n2  * 8 ^ /log  n  processors.  Step  2  takes  log  n  time 
units  with  Ln/2J  processors.  Hence,  Algorithm  Acyclic  takes 
0(log  n(n2*81)/?)  using  P  <  n2*81/log  n  processsors. 
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CHAPTER  4 


CONCLUSION 

It  has  been  demonstrated  that  under  an  idealized  model 
of  parallel  computation,  graph  problems  can  be  solved 
efficiently  by  representing  graphs  as  adjacency  matrices. 
Two  optimal  algorithms  were  presented  for  computating 
A(n)  =  a  { I )  ©a  (2)  ©.  . .  @a  (n)  ,  whose  time  bounds  were  proven  to 
be  equal  to  the  theoretical  lower  time  bounds  in  both 
bounded  and  unbounded  parallelism,  where  ©  is  any 
associative  binary  operation.  This  result,  together  with  the 
exploitation  of  the  reducibility  cf  the  problem  size  by  at 
least  half  after  each  iteration,  facilitates  the  reduction 
of  processor  requirements  by  a  factor  of  log  n  over  the 
existing  algorithms  for  a  set  of  graph  problems  that  can  be 
reduced  to  the  problem  of  finding  the  connected  components 
of  an  n-node  undirected  graph.  Moreover,  the  number  of 
processors  needed  to  execute  each  algorithm  is  optimally 
utilized  (i.e.,  E{P)  =  0(1)). 

In  general,  the  technique  developed  in  Section  3. 1 . 3 
for  achieving  the  processor  reduction  mentioned  above  can  be 
applied  analogously  to  reduce  the  processor  requirement  for 
any  parallel  algorithm  in  which  the  problem  size  is  reduced 
by  at  least  half  after  each  iteration.  furthermore,  if  a 
problem's  size  can  be  reduced  by  a  factor  of  the  square  root 
of  n  after  each  iteration,  0  (login)  time  requirement  can  be 
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improved  to  0  (logi-in)  by  a  similar  technique  where  i  >  1. 
An  algorithm  possessing  this  property  still  awaits  discovery 
however. 

As  pointed  out  in  Section  2.2.2,  the  processor 
requirements  of  many  existing  parallel  graph  algorithms  were 
shown  to  be  reducible  by  choosing  the  current  best  matrix 
multiplication  algorithm  (i. e.  Chandra*s  parallel  version  of 
Strassen*s  algorithm).  By  using  Chandra's  algorithm  to 
compute  the  transitive  closure  of  an  n-node  directed  graph, 
efficient  parallel  algorithms  were  formulated  for  detecting 
the  existence  of  negative  cycles  (in  Section  2.2.2),  finding 
the  strongly  connected  components,  verifying  unilateral 
connectivity  and  acyclicness  (in  Section  3. 2) ,  each  using 
0(log  n(n2*81)/P)  time  units  with  P  <  n2*81/log  n 
processors,  S(P)  =  0 (P/log  n)  and  E  (P)  =  0 (  1/log  n) . 

The  construction  of  an  0(log  n)  parallel  algorithm  for 
matrix  multiplication  using  less  than  0(n2*81/log  n) 
processors  remains  an  open  problem  whose  solution  would 
imply  improvements  on  the  corresponding  sequential  algorithm 
and  many  other  matrix  oriented  problems. 

The  results  in  this  thesis  should  provide  a  better 
understanding  of  the  relationship  cf  programs  to  machine 
organization  offering  new  insights  into  the  design  of 
practical  parallel  computers. 
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